Wednesday, January 4, 2017

Performance of N-Body Simulation Styles in Scala

I spend a lot of time doing N-body simulations. My working simulation code is written in C++. The reason for that is simple, speed. It is a serious number crunching code and even written in C++ runs can take months to complete. I would love to do these simulations in some other language that I enjoyed writing code in more, but I'm not willing to give up all that much speed in order to make that happen. A number of years ago a student and I explored the performance of a Java version of the code, and we found that we could get performance within 10-20% of the C++ version, but doing so required doing a number of things in a way that made the Java almost as challenging to work with as the C++. These days, I'd really love to be able to convert my code to Scala, as I find it much more enjoyable to code in, but I'd have to be certain that I can get performance close that what I'm getting in C++. My guess is that in the end I'll wind up updating my C++ code to use features of C++11, 14, and 17, but I still like to explore the possibilities of using a different language.

There are lots of ways that you can write N-body simulations in Scala. Some of them take advantage of the expressivity of the language to produce shorter code that is easier to read and generally less bug prone. However, there are reasons to believe that those approaches might also tend to be slower. For a while I've been meaning to explore different approaches to see how much of an impact it really has on the speed of execution. While there aren't all that many people who do N-body simulations, my expectation is that the results of these performance tests will apply to many other numerically intensive applications.

Array of Mutable Classes

In C++, one represents particles with a struct and makes sequences of these using arrays, vectors, or valarrays. Everything is generally mutable. This first version of the code in Scala is written to mirror the C++ code. Of course, there is a huge difference in the memory model. When you make a sequence of a simple struct in C++, the entire memory for all elements is laid out in a single contiguous block. The way this code is written in Scala, the array is an array of references to instances of MutableBody and each of those contains two references to instances of MVect3. This matters because caching is vitally important on modern computers and contiguous memory accesses have much better caching performance. In this simple example, odds are that because all of the objects are allocated in order at the beginning of the program, they wind up being contiguous in the heap, but there are still multiple levels of references that have to be followed, which are likely to impose some performance cost.

This code really does look similar to what the C++ code for this type of thing looks like, unless you happen to have a library set up for SIMD vector operations or expression templates for the 3-D vectors. Unfortunately, that isn't a good thing. Not only is this code verbose, I can speak from experience that it is bug prone. It is just too easy to have one of those x, y, or z fields typed wrong and the resulting error is very difficult to diagnose and track down.

Array of Immutable Classes

Given the capabilities of Scala, a much "better" way to write this is to use immutable classes and add symbolic methods to a 3-D vector class so that you can simplify the math expressions and make them less error prone. Of course, this change means that you are doing a lot of memory allocation making small objects for all of the 3-D vectors.

Clearly this code is shorter and more readable. It is still using the same basic approach with a mutable array for storage, but the ability to use mathematical operators on vectors improves the code significantly. It is worth noting that in the mutable version you can make operations like +=, but given the math that is being done, they aren't all that useful unless you are making temporary values to store scaled versions and such.

Arrays with Value Class Wrapper

Now we are going to get a bit more interesting and produce a version that tries to give us contiguous memory without hurting program readability and without making lots of temporaries. Basically, this is an attempt to make a version of the code that has the best chance of being speed competitive with the C++ code while still being reasonably readable. The real key areas of this code are lines 5-29 where we make a number of arrays of doubles and then define a value class. The value class is a newer feature of the Scala language that is stack allocated and has no more overhead than a primitive, while allowing you to have nice OO syntax. Instances of it are stored as just primitives on the stack and the methods wind up being static methods in the JVM, so there is no memory or speed overhead of object allocation. In order to use a value class, we have to put the whole thing in an object declaration and not a class declaration. Value classes can't go inside of other classes because then they would be non-static inner classes and that would mean they have the overhead of keeping a reference to their enclosing instance. They could go at the top level, outside of all other declarations, but then they wouldn't have access to the arrays. Putting both the arrays and the value class in an object declaration gives us what we need here.

Unfortunately, value classes are limited in Scala because they aren't really supported by the JVM at this time. The main limitation is that they can only have a single primitive value in them. There are plans to add value types to Java and the JVM. Once that happens, which probably will be Java 10, then you could make classes with three doubles in them that are stack allocated and arrays of them would be memory contiguous, giving you behavior much more similar to C++. Until that time, code like that above can give you a reasonable approximation.

Purely Functional Approaches

Of course, the real key to Scala is the functional aspects, something that none of the above examples took advantage of. The version that used immutable classes still mutated an array that stored references to those objects. I tried two different approaches to doing things functionally that are shown in the code below. The first approach is pretty much just like our immutable class version, but it uses a Vector and the updated method. Even without speed tests, we can feel confident that this is not an efficient approach, but it is pretty much required if we want to do the minimum number of distance calculations where each pair is only visited once. The second version, called forSim2, does twice as many distance calculations, but this allows it to be written in a much more succinct manner because we produce a full sequence with the accelerations to each particle without every mutating them. As it happens, this is the approach that we need to take if we parallelize the code to prevent race conditions. I will explore that more in a future blog post.

This code uses the ImmutableBody and the immutable Vect3 class shown earlier. That makes the code more compact than the non-mutable versions. Those unfamiliar with fold methods might find forSim2 difficult to read, but in general a fold can be used to replace a loop that would accumulate a value in a mutable way. Note that like the previous code, this is put in an object, but for completely different reasons. This code isn't stateful, so there is no reason to have a class. The state is created by the initBodies method and then passed into the forSim methods, which return modified values. This code never stores the state locally. This is a significant difference from the earlier code and should probably be expected for a functional approach.

Timing Results

So how do these different versions perform? We tested all of them using Scala 2.12.1 both with no optimization flags and with -opt:l:classpath -opt:closure-invocations -opt:simplify-jumps -opt:copy-propagation -opt:redundant-casts -opt:box-unbox -opt:nullness-tracking -opt:inline-global. The timing results are shown in the following table.

OptimizationStyleAverage Time [s]rms [s]
NoneValue Class0.7040.004
Mutable Class0.9040.006
Immutable Class1.2660.020
Functional 18.330.09
Functional 22.3430.048
OptimizedValue Class0.6790.001
Mutable Class0.6820.001
Immutable Class1.1910.026
Functional 18.550.11
Functional 22.3250.023

It is good to see that in many ways, my intuitions as to which versions would be fastest proved correct. Without optimization, the version that uses a value class and contiguous blocks of memory in arrays is clearly the fastest, followed by the mutable classes, then the immutable classes, then the purely functional approaches at the end. Note that the second functional approach takes nearly twice as long as the immutable class version. Since it is doing twice as many distance calculations, this indicates that the overhead of being functional is small and the speed difference is basically due to reorganizing the math. I will note again that this reorganization is also required for parallelization, so I would speculate that in parallelized versions, the second functional approach will be comparable to the immutable class version. The first functional approach is a clear loser. Calling updated so frequently on the Vector type is clearly inefficient. I will note though that I tried changing Vector to Array, and the resulting code was so slow for the first functional approach that I never saw it complete. On the other hand, using an array, even if it isn't mutated, seemed to slightly improve performance for the second functional approach.

It is also interesting to note that optimization benefited the mutable class version significantly, making it nearly comparable to the value class version.

Comparison to C++

We will close out this post with a comparison to C++. After all, we can't know if it is reasonable to use Scala for numerical workloads if we don't explore how it compares. I wrote a version of the code in C++ that was basically an adaptation of the mutable class version using a std::valarray for storage. I compiled it using g++ with the -Ofast compiler flag. I also made a separate application in Scala that only ran the value class version and had both the C++ and the Scala time 1000 steps of integration. The result was that the C++ generally completed in 6.22 seconds and the Scala completed in 6.88. Both were consistent across multiple runs. So the final result is a 10% advantage for C++. That isn't a huge advantage, and my guess is that most applications would be happy to live with that for the advantages that they get in coding in Scala over C++ in terms of developer productivity. Granted, that was with g++ I expect that using the Intel compiler or some other highly optimizing, and not free or open source, compiler will produce even faster executables for C++.

For me, the bottom line is that if you hear people say that Scala (or Java/JVM) are slow, there are good odds that they haven't really tested it compared to other languages/platforms. For straight number crunching applications, feel free to point them to this blog post to show them that the difference in speed really doesn't have to be all that large. My guess is that using macros, it would also be possible to create Scala code that has the readability of the immutable Vect3 class with symbolic methods, but without the speed cost for allocating lots of temporary values. This would be akin to expression templates for that purpose in C++. Maybe I'll take a stab at creating that code and write a blog about it as well. I also look forward to the Java 10 JVM adding value types, as I believe that they have the potential to significantly improve the performance of numerical code in all JVM languages.

Other Code

Here is the code for C++ and the timing in Scala in case anyone wants to be able to run everything.

C++ Code

Scala Timing