Sunday, January 1, 2017

Performance of Scala for Loops

One of the interesting features of the Scala programming language is that the for loop is basically syntactic sugar for calls to collection methods like foreach, map, filter, and flatMap. They are often called for-comprehensions and they have a lot more flexibility than what you get from a standard for loop in most languages or for-each loops in the languages that have them. The fact that they get compiled to calls to these other methods also means that Scala for-loops can be used on things that aren't collections like Options and Futures.

Unfortunately, this flexibility has often come at a price. As I noted in an earlier post, Loop Performance and Local Variables, using a for loop produced code that was significantly slower than using while loops. That was back in 2013, and I was using Scala 2.10. With the release of Scala 2.12, I really wanted to see if this was still the case. The primary change in 2.12 was to make Scala compile to Java 8 bytecode using all the new features of the Java 8 JVM. One of the main additions in Java 8 was lambda expressions. Since the foreach, map, filter, and flatMap are higher order methods that take functions, compiling to Java 8 lambdas seemed like it might improve performance. This post looks at testing that hypothesis.

Previous Code

We start by repeating the previous test that was written to look at where variables are declared. I took the same code as used before and simply ran it with three different versions of Scala, both with and without optimization. The following table shows the results.

VersionOptLoopVar LocAverage Time [ns]Time Deviation [ns]
2.10.6 forIn3.4100.141
Out3.7000.293
whileIn2.8230.270
Out2.8610.311
-optimizeforIn3.7120.588
Out3.8560.653
whileIn2.8950.252
Out2.8810.279
2.11.8 forIn3.2210.351
Out3.6660.397
whileIn3.0230.510
Out2.8480.243
-optimizeforIn3.3890.402
Out3.0140.120
whileIn2.8580.287
Out2.8480.254
2.12.1 forIn3.1540.354
Out3.4560.407
whileIn2.7650.167
Out2.7510.139
See BelowforIn3.2610.321
Out3.2790.549
whileIn3.1410.207
Out3.1640.264

All runs were done using Java 1.8.0_111 for the runtime. For 2.12, they added a lot of different optimization flags to the compiler. The values used for the timings in this post are -opt:l:classpath -opt:closure-invocations -opt:simplify-jumps -opt:copy-propagation -opt:redundant-casts -opt:box-unbox -opt:nullness-tracking -opt:inline-global. There is enough scatter here that it is hard to draw really strong conclusions. It appears that the while loop still has an advantage, but the percent difference in speed seems smaller across all the "current" compilers than what had been seen back in 2013. I put current in quotes because while 2.10 is older, 2.10.6 is a fairly recent release and the Scala team backports things when it makes sense, so there are good odds that 2.10.6 is incorporating optimizations of the for loop that weren't present in the earlier version of 2.10 I had been using in 2013.

N-Body Simulation

The problem of building multiplication tables was rather contrived as a simple example that worked well for testing the declaration locations of variables. If people are going to actually make their code uglier putting in while loops in place of for loops, it would be good to see if it matters on a somewhat more realistic example. For this I decided to do a simple first-order numerical integrator of bodies using gravity. This is a problem that involves a lot of number crunching in loops and which happens to be at least related to things that I write for my research, so it seemed like a good place to test performance.

The code used for this test is shown below. For the purposes of this post, what really matters is the forSim and whileSim methods. These have multiple loops including one area where they are triply nested. I store all the values in mutable arrays and then use a value class to access the elements in an object-oriented way. I chose this approach as there is minimal overhead from object allocation, potentially better cache performance, and I have a feeling that it is faster than other approaches, though testing that is a matter for later posts.

Here is a table giving the timing results for this code again the same three compilers.

VersionOptLoopAverage Time [s]Time Deviation [s]
2.10.6 for0.6660.002
while0.6670.029
-optimizefor0.6600.012
while0.6580.001
2.11.8 for0.7160.009
while0.6690.007
-optimizefor0.6750.006
while0.6560.001
2.12.1 for0.6990.003
while0.6830.001
See Abovefor0.6760.001
while0.6830.003

Note that for this code, there is very little difference between a for loop and a while loop. These tests were very stable in their timing results and while building up the tests I ran them multiple times and found little variation. It really doesn't appear that 2.12 did anything to help with the difference between for and while loops in either of these examples, but in this one, there really isn't a significant difference in any version. What does that mean? As with so many things dealing with performance, you should write clean code that runs first. Once you have that, and you are tweaking things for performance, you might consider changing your inner-most loops from for loops to while loops, but it is quite possible that it won't matter.

I also feel compelled to note that the for loop version is much easier to parallelize than the while loop version because of the ease of switching to a parallel collection. I haven't done it here as one must make some alterations to prevent race conditions, but that is something that I might also explore in a future post.

Variables in for Comprehensions

There is one caveat to the conclusion that for loops don't hurt performance in the larger example. In the forSim method shown above, the variables pi and pj are both declared inside of the inner most loop. The for comprehension in Scala allows variables to be declared in the "header" section of the loop. When I first wrote this code, I declared pi between the two generators and pj right after the generator for j. One wouldn't think that this would matter much, but it did. Having the declarations up in the header instead of the body cause this code to run roughly 2.5-3x slower than when they were put as the first lines in the body of the for loop. I don't have an explanation for this behavior and I haven't explored the generated bytecode to see what might be causing it. However, based on this result, it is probably worth not using the variable declaration capabilities of for comprehensions if performance is critical to your application.