One of the interesting features of the Scala programming language is that the for loop is basically syntactic sugar for calls to collection methods like foreach, map, filter, and flatMap. They are often called for-comprehensions and they have a lot more flexibility than what you get from a standard for loop in most languages or for-each loops in the languages that have them. The fact that they get compiled to calls to these other methods also means that Scala for-loops can be used on things that aren't collections like Options
and Futures
.
Unfortunately, this flexibility has often come at a price. As I noted in an earlier post, Loop Performance and Local Variables, using a for loop produced code that was significantly slower than using while loops. That was back in 2013, and I was using Scala 2.10. With the release of Scala 2.12, I really wanted to see if this was still the case. The primary change in 2.12 was to make Scala compile to Java 8 bytecode using all the new features of the Java 8 JVM. One of the main additions in Java 8 was lambda expressions. Since the foreach, map, filter, and flatMap are higher order methods that take functions, compiling to Java 8 lambdas seemed like it might improve performance. This post looks at testing that hypothesis.
Previous Code
We start by repeating the previous test that was written to look at where variables are declared. I took the same code as used before and simply ran it with three different versions of Scala, both with and without optimization. The following table shows the results.
Version | Opt | Loop | Var Loc | Average Time [ns] | Time Deviation [ns] |
---|---|---|---|---|---|
2.10.6 | for | In | 3.410 | 0.141 | |
Out | 3.700 | 0.293 | |||
while | In | 2.823 | 0.270 | ||
Out | 2.861 | 0.311 | |||
-optimize | for | In | 3.712 | 0.588 | |
Out | 3.856 | 0.653 | |||
while | In | 2.895 | 0.252 | ||
Out | 2.881 | 0.279 | |||
2.11.8 | for | In | 3.221 | 0.351 | |
Out | 3.666 | 0.397 | |||
while | In | 3.023 | 0.510 | ||
Out | 2.848 | 0.243 | |||
-optimize | for | In | 3.389 | 0.402 | |
Out | 3.014 | 0.120 | |||
while | In | 2.858 | 0.287 | ||
Out | 2.848 | 0.254 | |||
2.12.1 | for | In | 3.154 | 0.354 | |
Out | 3.456 | 0.407 | |||
while | In | 2.765 | 0.167 | ||
Out | 2.751 | 0.139 | |||
See Below | for | In | 3.261 | 0.321 | |
Out | 3.279 | 0.549 | |||
while | In | 3.141 | 0.207 | ||
Out | 3.164 | 0.264 |
All runs were done using Java 1.8.0_111 for the runtime. For 2.12, they added a lot of different optimization flags to the compiler. The values used for the timings in this post are -opt:l:classpath -opt:closure-invocations -opt:simplify-jumps -opt:copy-propagation -opt:redundant-casts -opt:box-unbox -opt:nullness-tracking -opt:inline-global
. There is enough scatter here that it is hard to draw really strong conclusions. It appears that the while loop still has an advantage, but the percent difference in speed seems smaller across all the "current" compilers than what had been seen back in 2013. I put current in quotes because while 2.10 is older, 2.10.6 is a fairly recent release and the Scala team backports things when it makes sense, so there are good odds that 2.10.6 is incorporating optimizations of the for loop that weren't present in the earlier version of 2.10 I had been using in 2013.
N-Body Simulation
The problem of building multiplication tables was rather contrived as a simple example that worked well for testing the declaration locations of variables. If people are going to actually make their code uglier putting in while loops in place of for loops, it would be good to see if it matters on a somewhat more realistic example. For this I decided to do a simple first-order numerical integrator of bodies using gravity. This is a problem that involves a lot of number crunching in loops and which happens to be at least related to things that I write for my research, so it seemed like a good place to test performance.
The code used for this test is shown below. For the purposes of this post, what really matters is the forSim
and whileSim
methods. These have multiple loops including one area where they are triply nested. I store all the values in mutable arrays and then use a value class to access the elements in an object-oriented way. I chose this approach as there is minimal overhead from object allocation, potentially better cache performance, and I have a feeling that it is faster than other approaches, though testing that is a matter for later posts.
Here is a table giving the timing results for this code again the same three compilers.
Version | Opt | Loop | Average Time [s] | Time Deviation [s] |
---|---|---|---|---|
2.10.6 | for | 0.666 | 0.002 | |
while | 0.667 | 0.029 | ||
-optimize | for | 0.660 | 0.012 | |
while | 0.658 | 0.001 | ||
2.11.8 | for | 0.716 | 0.009 | |
while | 0.669 | 0.007 | ||
-optimize | for | 0.675 | 0.006 | |
while | 0.656 | 0.001 | ||
2.12.1 | for | 0.699 | 0.003 | |
while | 0.683 | 0.001 | ||
See Above | for | 0.676 | 0.001 | |
while | 0.683 | 0.003 |
Note that for this code, there is very little difference between a for loop and a while loop. These tests were very stable in their timing results and while building up the tests I ran them multiple times and found little variation. It really doesn't appear that 2.12 did anything to help with the difference between for and while loops in either of these examples, but in this one, there really isn't a significant difference in any version. What does that mean? As with so many things dealing with performance, you should write clean code that runs first. Once you have that, and you are tweaking things for performance, you might consider changing your inner-most loops from for loops to while loops, but it is quite possible that it won't matter.
I also feel compelled to note that the for loop version is much easier to parallelize than the while loop version because of the ease of switching to a parallel collection. I haven't done it here as one must make some alterations to prevent race conditions, but that is something that I might also explore in a future post.
Variables in for Comprehensions
There is one caveat to the conclusion that for loops don't hurt performance in the larger example. In the forSim
method shown above, the variables pi
and pj
are both declared inside of the inner most loop. The for comprehension in Scala allows variables to be declared in the "header" section of the loop. When I first wrote this code, I declared pi
between the two generators and pj
right after the generator for j
. One wouldn't think that this would matter much, but it did. Having the declarations up in the header instead of the body cause this code to run roughly 2.5-3x slower than when they were put as the first lines in the body of the for loop. I don't have an explanation for this behavior and I haven't explored the generated bytecode to see what might be causing it. However, based on this result, it is probably worth not using the variable declaration capabilities of for comprehensions if performance is critical to your application.
Dr. Lewis, I am glad I found this blog site.
ReplyDeleteI have watched a lot of your videos about scala and scalafx
and have learned much. There have been many times I wanted to ask you a question but did not know how to reach you. I know this off topic of the above blog but if I may: Can the fast scala compiler be used with scalafx? Steve
Dr. Lewis, I am glad I found this blog site.
ReplyDeleteI have watched a lot of your videos about scala and scalafx
and have learned much. There have been many times I wanted to ask you a question but did not know how to reach you. I know this off topic of the above blog but if I may: Can the fast scala compiler be used with scalafx? Steve