Saturday, October 20, 2018

Scala Numerical Performance with Scala Native and Graal

This post is a follow-on to my earlier post looking at the performance of different approaches to writing an n-body simulation using Scala. That post focused on the style of Scala code used and how that impacted the performance. This post uses that same code, so you should refer to it to see what the different types of simulations are doing.


Comparing Runtimes

For someone who is interested in doing numerical simulations, Scala and the JVM might seem like odd choices, but the benefits of having a good language that is highly expressive and maintainable can be significant, even for numerical work. In addition to the impact of programming style and the optimizations done by the compiler that produces JVM bytecode, performance can also be significantly impacted by the choice of runtime environment.

Historically, there wasn't much in the way of choice. Sun, then later Oracle, made the only JVM that was in even reasonably broad usage, and enough effort was put into making the hotspot optimizer work that it was generally a good choice. Today, however, there are a few more options. If you are running Linux, odds are good that you have OpenJDK by default and would have to specifically download and install the Oracle version if you want it. In addition, Oracle has recently been working on Graal, a new virtual environment for both JVM and non-JVM languages. Part of the argument for Graal was that the old C2 hotspot compiler, written in C++, had simply because too brittle and it was hard to add new optimizations. Graal is being built fresh from the ground up using Java, and many new types of analysis are included in it. While I have seen benchmarks indicating the Graal, though young, is already a faster option for many Scala workloads, I wasn't certain if that would be the case for numerical work. This is at least in part due to the fact that one of the Graal talks this last summer at Scala Days mentioned that Graal was not yet emitting SIMD instructions for numerical computations.

In addition, the newest addition to the list of supported environments for Scala is Scala Native. This project uses LLVM to compile Scala source to native executables. One of the main motivators for this right now is using Scala for batch processing because native executables don't suffer from the startup times of bringing up the JVM. This project is still in beta, but I wanted to see if it might be able to produce executables with good numerical performance as well.

For these benchmarks, the Scala code was compiled with -opt:_ and run on each JVM with no additional options. I am using a different machine from my earlier post, which explains the significant runtime differences between this post and the earlier one using a similar JVM. The following table gives timing results for the five approaches using the five different runtimes.

EnvironmentStyleAverage Time [s]Stdev [s]
Oracle JDK 8-191Value Class0.3940.012
Mutable Class0.6830.015
Immutable Class0.8090.010
Functional 14.2460.439
Functional 21.7230.027
Oracle JDK 11Value Class0.3780.006
Mutable Class0.6900.012
Immutable Class0.9400.083
Functional 14.4200.059
Functional 21.5890.021
OpenJDK 10Value Class0.3880.008
Mutable Class0.7150.006
Immutable Class0.8920.013
Functional 14.4050.039
Functional 21.6890.013
GraalVM 1.0.0-rc7, Java 1.8Value Class0.3770.003
Mutable Class0.3960.003
Immutable Class0.6940.108
Functional 14.0540.151
Functional 20.7930.016
Scala Native 0.3.8Value Class2.6030.185
Mutable Class1.0280.020
Immutable Class2.5950.020
Functional 116.841.39
Functional 25.2320.655

Looking at the first three runtimes, we notice that there is very little difference between Oracle and OpenJDK over various Java versions from 8 to 11. In all three, the value class approach is fastest by far followed by the version with the mutable classes, then the immutable classes with functional approaches being slowest by a fair margin.

Things get more interesting when we look at Graal. The performance of the value class version is roughly the same as for the other JVMs, but every other version runs significantly faster under Graal than in the others. The ordering stays the same as to which approaches are fastest, but the magnitude of how much slower each version is than the value class version changes dramatically. Under Graal, the mutable class version is almost as fast as the value class version instead of being almost a factor of two slower. Most impressive is that the second functional version is only a factor of two slower than the value class version instead of being four times slower. This is significant for two reasons. One is that it is the version written in the most idiomatic Scala style. The other is that this version literally does twice as much work in terms of distance calculations as the other versions. That means that we really can't expect it to do any better than being 2x as slow. The fact that it takes nearly twice as long to run means that using Graal there isn't a significant overhead to the functional approach the way there is using the older JVMs.

At the end of the table, we have the results for Scala Native. Unfortunately, it is clear that Scala Native is not yet ready for running with performance-critical numerical code. One result that stands out is that the value class version is not the fastest. Indeed, it runs at a speed roughly equal to the immutable class and 2.5x slower than the mutable class. I assume that this means that the value class optimizations have not yet been implemented in Scala Native. As to why even the mutable class version is more than 2x slower than Graal and at least 50% slower than the other VMs is a bit puzzling to me as I did the timing using a release build. I expect that this is something that will improve over time. Scala Native is still in the very early stages, and there is a lot of room for the project to grow.


Comparison to C++

As before, I also ran a test comparing these Scala results to C++ code compiled with the GNU compiler using the -Ofast flag. This uses a simpler test with the value class technique. You can see in the table below that the Scala code is performing about 15% slower than C++ in all of the environments except Scala Native, which is several times slower. Given the results above indicating that Scala Native isn't nicely optimizing value classes yet, this result for Scala Native isn't surprising.

EnvironmentAverage Time [s]
g++3.29
Oracle JDK 8-1913.88
Oracle JDK 113.77
OpenJDK 103.82
GraalVM 1.0.0-rc73.82
Scala Native21.6


Conclusions

For me, there are two main takeaway messages from this. The first is that while Scala Native holds the longterm potential to give Scala a higher performance platform for running computationally intensive jobs, it isn't there yet. A lot more work needs to go into optimization to get it to reach its potential of competing with other natively compiled languages. I firmly believe that it can get there and is moving in that direction, but it isn't ready yet.

On the other hand, these results indicate to me that if you are programming Scala, you should strongly consider using Graal, even if you are doing numeric work. Based on presentations at Scala Days 2018 in New York, I know that this is the case for non-numeric codes, but at the time Graal wasn't emitting SIMD instructions, so it wasn't clear if it would compare well to the old C2 hot-spot optimizer. These results show that regardless of style, Graal is at least as performant as the other JVM options and that in some cases it is much faster. Perhaps most significantly, the functional 2 style, which is written in a much more idiomatic style for Scala, is more than 2x as fast with Graal was with the other JVMs. I should also note that Graal still allows me to run graphical applications like my SwifVis2 plotting package, so there isn't any loss of overall functionality.

Going forward, I want to test the performance of more complex n-body simulations using trees and also look at multithreaded performance to see how Graal compares for those. Scala Native is still only single threaded for pure Scala code, so it will likely be left out of those tests.

GraalVM native images are another feature that I would really like to explore, but there are some challenges in building them from Scala code that I did take the time to overcome for this post.