Today, as Arm processors are primarily viewed as targeting the embedded market, and justifiably so, multiple hardware vendors are using this architecture to build server CPUs and to compete with Intel in the cloud and High Performance Computing (HPC) segment. This broadens the variety of Java applications that run on Arm CPUs and adds to the complexity of the Java Arm port itself, as it must support a segmented variety of CPU vendors and workloads.
In this post which follows the article written for the Java Magazine, I explore the status of Java and the Java ecosystem on Arm and its evolution. I also discuss some recent developments in Java Arm port features and performance, emphasizing both the server and IoT/embedded deployments.
Leaving aside the embedded and mobile markets, where Arm dominates with 32-bit ARMv5, 6, 7 and 8 ISA, it’s no longer stretching the point to say that Arm provides a viable alternative to the markets that are currently dominated by x86 architecture. Unlike CPU vendors like Intel, which focus on shipping processors and evolve the x86 architecture to do so, Arm is primarily an architecture design company selling architectural and core licenses to its customers, which turn it into actual silicon. This allows a great variety of actual implementations of the same architecture to co-exist and compete in different market segments.
It is clearly visible from the recent developments of the Arm architecture itself that the focus has shifted to allow competitive Arm-based server CPU designs.
In 2016, Arm finalized a 64-bit and 32-bit capable ARMv8-A ISA, targeting both the embedded and server markets. ARMv8-A architecture, which added support for 64-bit and mandated the presence of NEON SIMD instructions, also introduced optional instructions for AES encryption, SHA-1, SHA-256 and CRC32, which some vendors implement to boost cryptographic and checksum performance.
Arm did not stop there. In 2017, Arm extended this architecture with the ARMv8.1-A update, most notably adding new atomic instructions. Later, ARMv8.2-A added half-precision floating-point data processing and dot product SIMD instructions. What’s more important, starting with ARMv8.2-A, optional SVE (Scalable Vector Extension) instructions introduced better support for vectorization compared to the NEON instruction set, making the ARMv8 architecture much better suitable for HPC. Recently, ARMv8.3-A added SIMD complex number support and weaker release consistency instructions.
The ARMv8 architecture leaves room for vendor design selection to achieve performance, complexity, and power goals. It adopts a relaxed hardware memory model which is weaker than x86-TSO. Thus, one can observe more out-of-order effects. Compared to ARMv7, there are useful concurrency primitives, including the load-acquire and store-release instructions, as well as weaker barrier instructions. But a cautious programmer who follows Java language memory model will not notice these differences, because the JVM hides it inside the implementation.
Several hardware vendors contend with Intel in the server market with their ARMv8-based processor designs, and this should already be taken seriously. Some hardware vendors are new to the Arm server ecosystem, such as Qualcomm with its Centriq 2400 offering, or the relaunched Ampere, which possesses assets from APM. Others, like Cavium, which was recently acquired by Marvell, are already established in the Arm server market and have a track record of delivering ThunderX production systems for several years now. They recently released the second generation ThunderX2 systems. ThunderX systems are readily available in the cloud from cloud providers such as Packet and Scaleway. AWS recently launched its A1 instances powered by Graviton Processors.
The real competition for ARMv8-based server hardware vendors is not the ARMv8 vendors themselves. It is Intel who they all are trying to compete with. And there is more than just the cloud market for Arm: Cray and HPE are shipping HPC solutions based on ARMv8 architecture, making the HPC future of ARMv8 real. As part of the Vanguard program, Sandia National Labs is deploying its Arm-based supercomputer with a theoretical peak of more than 2.3 petaflops.
Aside from the CPU core design, ARMv8-based server vendors invest heavily in parallelism and memory bandwidth, while keeping the power consumption low. For example, Qualcomm Centriq 2400 platform is said to have 48 single-threaded cores and 6 channels of DDR4 memory per SoC. Marvell ThunderX2 has up to 64 four-threaded cores in dual-socket configuration (making the total number of hardware threads 256) and 8 channels of DDR4 memory per socket.
In the embedded segment, things are more traditional, and most chip makers license the Cortex-A core from Arm instead of building their own.
Availability of Java on Arm
End users will find a good choice of providers of Java and OpenJDK binaries for Arm.
Both the ARMv7 and ARMv8 Java ports are fully functional and the sources are available from OpenJDK under the GPLv2.1 license with the classpath extension, which enabled most Linux distributions to include them in their package repository.
Sometimes using OpenJDK binaries provided by the Linux package management systems is for some reason not preferred. For instance, if your favorite Linux distribution does not contain the required packages or you are looking for commercial support, there is an excellent choice of different versions of Java/OpenJDK binaries provided by AdoptOpenJDK, Azul, BellSoft and Oracle. At the time this article was published, Oracle only provided JDK 8 binaries for ARMv8 and ARMv6/7, Azul provided binaries for JDK 8 and 11, while BellSoft offers binaries for JDK 8, 9,10 and 11 which, for the Raspberry Pi, include the OpenJFX and Device IO API modules. Azul, BellSoft and most notably Oracle provide supported binaries that comply with the Java SE specification and verify their binaries with the JCK test suite.
Features of the Java Arm ports
Java and OpenJDK Arm ports are mature for production use.
The minimum requirement for Java and OpenJDK implementations is seeking conformance with the Java SE Specification by passing the Java SE Compatibility Test Suite (JCK). Arm and ARMv8 ports have reached that level of compatibility long ago and are first-class citizens among Java-supported platforms, along with x86 and SPARC.
While it is very important to ensure compatibility of Java implementations, passing the JCK is not the only requirement for a successful Java port. To meet startup and throughput performance expectations, both ARMv7 and ARMv8 ports have C1 and C2 JIT compilers implemented, thus allowing them to produce optimized code that takes advantage of the underlying architecture specifics. On top of that, the -XX:+TieredCompilation is supported and turned on in the Server VM, which allows for leveraging from fast startup and achieving C2 throughput. A full set of GCs is supported in both ARMv7 and ARMv8 ports: ParallelGC, G1, SerialGC, CMS (Deprecated).
For embedded use-cases, the ARMv7 port seen in some bundles also carries a lightweight Minimal VM. For JDK 9 or higher, it allows building Java runtime images with a low static footprint using the Jigsaw feature. For example, running
OUTPUT=~/out bin/jlink --module-path jmods --compress=2 --add-modules java.base --output $OUTPUT rm -r $OUTPUT/lib/client $OUTPUT/lib/server echo "-minimal KNOWN" > $OUTPUT/lib/jvm.cfg
on BellSoft Arm JDK 10, which provides Minimal VM, produces a java runtime with the java.base module with a static footprint as small as 16 Mb. Surprisingly, java.base (maybe with the addition of several other modules) is sufficient for quite a number of Java applications targeted for constrained IoT gateways. For example, a runtime capable of running Apache Felix or Jetty fits into 32 Mb.
Over the years, the ARMv8 port received built-in optimized assembly intrinsics for CPU-intensive operations. At present, only several intrinsics present in the x86 port are absent in the ARMv8 port, and the gap is rapidly closing.
See Table 1 for a detailed comparison of major JVM features across upstream x86, ARMv8 64bit and 32bit Arm ports.
Table 1. Comparison of major upstream x86 and Arm JVM port features.
|Minimal||Yes (32 bit)||Yes, since JDK 12||Yes|
|Graal JIT (Experimental)||Yes, since JDK 10||Yes, since JDK 11||No|
|CMS||Yes, Deprecated||Yes, Deprecated||Yes, Deprecated|
|AppCDS||Yes||Yes, since JDK 10||Yes, since JDK 10|
|Serviceability||Java Flight Recorder||Yes||Yes, since JDK 11||Yes, since JDK 11|
Performance of AARCH64 JVM port
Hardware, OS and the JVM all contribute to the performance of Java applications and benchmarks. Let's dive into the performance of the ARMv8 port, as the server market is where performance matters most. To make a good comparison, it is important to find x86 and Arm server equivalents. Luckily, the recently released Marvell ThunderX2 ARMv8 CPUs provides a comparable Intel equivalent for each SKU based on SPECint2017 rates. From this table, for studying performance, I selected the ThunderX2 CN9975 and its comparable Intel Xeon Gold 6140 single socket system, both equipped with DDR4-2666 memory and running Ubuntu 16.04. Dual socket systems with these CPUs are also available.
It’s interesting to note that ThunderX2 CN9975 CPU has 112 threads (28-core system with 4-way SMP), and the comparable Intel Xeon Gold 6140 has 36 threads (18-core system with Hyper-Threading).
To assess the performance of the JVM ARMv8 and x86 ports, I used the SPECjbb2015 1.01 and SPECjvm2008 1.01 benchmarks running the OpenJDK 11 EA build 18. All benchmarks were executed 20 times, and mean values were collected. The SPECjbb2015 benchmark was used to obtain an overall score, while the SPECjvm2008 provided additional insights into the performance of ARMv8 HotSpot JVM port.
Since this article is not intended to report the best score obtainable on a specific hardware system, but to study the performance of what a typical end user would see, I intentionally did not fine-tune low level JVM parameters or kernel settings on either system. Check the SPEC scores for the processors as reported by the hardware vendors to compare the highest achievable numbers with JVM options tuning.
The SPECjbb2015 1.01 Composite results (Critical-jOPS and Max-jOPS) are presented in Figure 1. The JVM command line options used for these runs were very common for SPECjbb2015 runs:
-Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:-UseBiasedLocking
for ARMv8 and
-Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:+UseBiasedLocking
for x86. Switching biased locking off for ARMv8 and leaving it on for x86 gave both platforms slightly better results.
Figure 1. SPECjbb2015-Composite performance results on single-socket Xeon Gold 6140 and ThunderX2 CN9975 with DDR4-2666 memory running Ubuntu 16.04. Higher is better.
As can be seen from the results presented, OpenJDK 11 ARMv8 port running on ThunderX2 outperforms the x86 port on Xeon Gold 6140 by 33% in SPECjbb2015 Max-jOPS score and by 16% in SPECjbb2015 Critical-jOPS score. Long story short, the ThunderX2 system with ARMv8 JVM port is very well suitable for enterprise workloads represented by the SPECjbb2015 benchmark.
To assess the per-thread performance, I also limited the number of CPU threads on ThunderX2 to be the same as on Intel Xeon Gold 6140, which only used 32% of its CPU threads. Unsurprisingly, in this case SPECjbb2015 clearly favoured the Xeon Gold, giving it a 30% advantage.
The SPECjvm2008 Base results for individual benchmarks together with the composite Base are presented in Figure 2. Since the SPECjvm2008 “compiler” benchmark has not worked since JDK 8, the composite geomean base score was manually calculated without a “compiler” benchmark result.
Figure 2. SPECjvm2008 performance results on single-socket Xeon Gold 6140 and ThunderX2 CN9975 with DDR4-2666 memory running Ubuntu 16.04. Higher is better.
As can be seen from the results presented, the OpenJDK 11 ARMv8 port running on ThunderX2 outperforms the x86 port on Xeon Gold 6140 by 28% in SPECjvm2008 benchmark composite Base score. There are two main reasons for the overall better ARMv8 system score. The first is that it has a higher memory bandwidth (8 channels compared to 6 channels on Intel). The second is related to the work done in the ARMv8 Java port that allowed for the full utilization of the CPU potential and extensions.
To gain additional insights, let's explore the scores for individual SPECjvm2008 workloads. In eight out of nine SPECjvm2008 benchmarks, the ARMv8 port outperformed x86, and for one the results were the opposite.
The Crypto benchmark clearly favors an ARMv8-based system, giving it a 62% advantage, which would not be reachable if the ARMv8 port didn’t fully utilize the AES and SHA extensions available on this chip.
The compress benchmark (where ARMv8-based system beats Intel by 12%) uses the CRC32C intrinsic. XML (ARMv8 beats Intel by 29%) and MpegAudio (by 44%) benchmarks use the java.lang.String and java.lang.Arrays intrinsics. Some of these intrinsics were recently improved in JDK 10 and in 11 for ARMv8 by BellSoft together with Cavium/Marvell.
It is also important to understand the results for the benchmark where x86 OpenJDK port did better: scimark.small (by 29%). The reason for that is the benchmark code: FFT, LU, SOR and SPARSE scimark subbenchmarks all contain heavy loops and matrix computation code. Over the years a lot of efforts have been made by Intel into loop unrolling and vectorization, which allowed for mapping such code sequences to AVX instructions on x86. This work has not yet been completed for the ARMv8 C2 port, and the absence of a good equivalent to AVX 512-bit is not helping (that gap will be closed when Arm delivers SVE). On top of that, the FFT scimark subbenchmark uses java.lang.Math functions (intrinsified for both x86 and ARMv8, the latter since JDK 11), which use the ARMv8 128-bit NEON SIMD. With scimark.large, this effect is mitigated because scimark.large does computations on a large dataset, which implies memory access, hence giving the ARMv8-based system the possibility to show a wider memory bandwidth.
There is definitely some work ahead in order to bring the ARMv8 port scientific workload performance up to par with x86. However, right now it already can be concluded that for regular server-side Java business application workloads (data processing, XML, crypto operations), the OpenJDK 11 ARMv8 port running on ThunderX2 SKUs provides better performance compared to the x86 equivalent.
Performance diagnostics tools are essential for understanding the bottlenecks of a Java application being developed or run in production.
Regular performance diagnostics through JMX and JVMTI API work on Arm just like they do on x86. For more thorough Java performance analysis, BellSoft also ported the AsyncProfiler and HonestProfiler to ARMv8 and contributed the changes back to the projects. This allowed for enhancing the performance of an application as complex as Hadoop on ARMv8. You can see a subset of the flamegraph from Hadoop Terasort benchmark collected from the Hadoop JVMs using AsyncProfiler on ARMv8 in Figure 3. If the reader is working on a complex Java application and would like to profile the JVM bottlenecks on Arm (or any other architecture), these are the open-source tools I would recommend.
Figure 3. Flamegraph built with AsyncProfiler data running Hadoop on ARMv8.
JFR, which was open-sourced by Oracle and contributed to OpenJDK 11 was also made available in Arm ports. Figure 4 shows method profiling output from JFR recording on JDK 11 ARMv8. As usual for JFR, the profiling overhead on ARMv8 was low (1-2%) and allowed to obtain detailed profiling information, which is very suitable for production system monitoring.
Figure 4. OpenJDK Java Mission Control view of a JFR profile collected from Map Hadoop task on ARMv8.
Java Ecosystem on Arm
In theory, all software written in Java should just be able to work on Java on Arm. That said, some big projects make specific tweaks that tie them to a specific architecture, such as using natively-built libraries (for example, snappy). The following popular projects, though not claiming official support for ARMv8, were tested to work well on Arm: Hadoop 3.1.0, Tomcat 9.0.8, Spark 2.3.0, Kafka 1.1.0, Cassandra 3.11.2, Lucene 7.3.0, Flink 1.4.2.
A number of companies including Arm, BellSoft, Cavium/Marvell, Linaro, Oracle, Red Hat and others collaborate in the OpenJDK codebase to ensure the long-term future of upstream Arm ports. That includes gradual improvement in performance and stability, as well as working on a fully-supported Graal VM and Graal as a JIT compiler on ARMv8, ZGC, and such future projects as Valhalla and Panama. Aside from that, work is underway to ensure good use of the planned SVE instructions.
The upstream ARM 32-bit and ARMv8 Java ports are ready for production use, with all of the relevant features on par with x86.
The 32-bit Arm port provides all the necessary functionality for Embedded & IoT deployments, including C1 for fast startup, low dynamic memory footprint and Minimal VM which allows for producing Java Runtime images with a low static footprint (under 16 Mb). It works well on such popular devices as the Raspberry Pi and, after proper device and application specific tuning, it is possible to use the 32-bit Arm port in production under the GPL license.
The ARMv8 port which is aimed primarily at the server market, shows better performance results when compared to x86 on equivalent hardware (16% advantage in SPECjbb2015 Critical-jOPS, 33% advantage in SPECjbb2015 Max-jOPS and 28% advantage in SPECjvm2008 base composite compared to x86). As demonstrated by the SPECjvm2008 benchmarks for typical server-side Java business applications which would process and encrypt data and XMLs, the OpenJDK 11 ARMv8 port running on ThunderX2 is faster than the Intel SKU counterpart. The reasons why the SPECjvm2008 scimark.small performance is lower compared to Intel were analysed and should not be too difficult to improve.
Overall, the Java software ecosystem is ready for production deployments on Arm. For Embedded & IoT use cases, Arm is already the primary platform of choice, but why would the server market and major cloud providers consider moving to a different architecture if the performance advantage is only tens of percents? The answer to that is price for performance. Considering the performance of the JVM on Arm, and the price of the CPUs, this starts to make sense. And it becomes very easy to try, considering how small the efforts are to take existing Java applications to a new architecture, provided there is a fully functional, performing and stable Java.