Dynamics of Programming: Performance of Scala for Loops

One of the interesting features of the Scala programming language is that the for loop is basically syntactic sugar for calls to collection methods like foreach, map, filter, and flatMap. They are often called for-comprehensions and they have a lot more flexibility than what you get from a standard for loop in most languages or for-each loops in the languages that have them. The fact that they get compiled to calls to these other methods also means that Scala for-loops can be used on things that aren't collections like Options and Futures.

Unfortunately, this flexibility has often come at a price. As I noted in an earlier post, Loop Performance and Local Variables, using a for loop produced code that was significantly slower than using while loops. That was back in 2013, and I was using Scala 2.10. With the release of Scala 2.12, I really wanted to see if this was still the case. The primary change in 2.12 was to make Scala compile to Java 8 bytecode using all the new features of the Java 8 JVM. One of the main additions in Java 8 was lambda expressions. Since the foreach, map, filter, and flatMap are higher order methods that take functions, compiling to Java 8 lambdas seemed like it might improve performance. This post looks at testing that hypothesis.

Previous Code

We start by repeating the previous test that was written to look at where variables are declared. I took the same code as used before and simply ran it with three different versions of Scala, both with and without optimization. The following table shows the results.

Version	Opt	Loop	Var Loc	Average Time [ns]	Time Deviation [ns]
2.10.6		for	In	3.410	0.141
		for	Out	3.700	0.293
		while	In	2.823	0.270
		while	Out	2.861	0.311
	-optimize	for	In	3.712	0.588
		for	Out	3.856	0.653
		while	In	2.895	0.252
		while	Out	2.881	0.279
2.11.8		for	In	3.221	0.351
		for	Out	3.666	0.397
		while	In	3.023	0.510
		while	Out	2.848	0.243
	-optimize	for	In	3.389	0.402
		for	Out	3.014	0.120
		while	In	2.858	0.287
		while	Out	2.848	0.254
2.12.1		for	In	3.154	0.354
		for	Out	3.456	0.407
		while	In	2.765	0.167
		while	Out	2.751	0.139
	See Below	for	In	3.261	0.321
		for	Out	3.279	0.549
		while	In	3.141	0.207
		while	Out	3.164	0.264

All runs were done using Java 1.8.0_111 for the runtime. For 2.12, they added a lot of different optimization flags to the compiler. The values used for the timings in this post are -opt:l:classpath -opt:closure-invocations -opt:simplify-jumps -opt:copy-propagation -opt:redundant-casts -opt:box-unbox -opt:nullness-tracking -opt:inline-global. There is enough scatter here that it is hard to draw really strong conclusions. It appears that the while loop still has an advantage, but the percent difference in speed seems smaller across all the "current" compilers than what had been seen back in 2013. I put current in quotes because while 2.10 is older, 2.10.6 is a fairly recent release and the Scala team backports things when it makes sense, so there are good odds that 2.10.6 is incorporating optimizations of the for loop that weren't present in the earlier version of 2.10 I had been using in 2013.

N-Body Simulation

The problem of building multiplication tables was rather contrived as a simple example that worked well for testing the declaration locations of variables. If people are going to actually make their code uglier putting in while loops in place of for loops, it would be good to see if it matters on a somewhat more realistic example. For this I decided to do a simple first-order numerical integrator of bodies using gravity. This is a problem that involves a lot of number crunching in loops and which happens to be at least related to things that I write for my research, so it seemed like a good place to test performance.

The code used for this test is shown below. For the purposes of this post, what really matters is the forSim and whileSim methods. These have multiple loops including one area where they are triply nested. I store all the values in mutable arrays and then use a value class to access the elements in an object-oriented way. I chose this approach as there is minimal overhead from object allocation, potentially better cache performance, and I have a feeling that it is faster than other approaches, though testing that is a matter for later posts.

Here is a table giving the timing results for this code again the same three compilers.

Version	Opt	Loop	Average Time [s]	Time Deviation [s]
2.10.6		for	0.666	0.002
		while	0.667	0.029
	-optimize	for	0.660	0.012
	-optimize	while	0.658	0.001
2.11.8		for	0.716	0.009
		while	0.669	0.007
	-optimize	for	0.675	0.006
	-optimize	while	0.656	0.001
2.12.1		for	0.699	0.003
		while	0.683	0.001
	See Above	for	0.676	0.001
	See Above	while	0.683	0.003

Note that for this code, there is very little difference between a for loop and a while loop. These tests were very stable in their timing results and while building up the tests I ran them multiple times and found little variation. It really doesn't appear that 2.12 did anything to help with the difference between for and while loops in either of these examples, but in this one, there really isn't a significant difference in any version. What does that mean? As with so many things dealing with performance, you should write clean code that runs first. Once you have that, and you are tweaking things for performance, you might consider changing your inner-most loops from for loops to while loops, but it is quite possible that it won't matter.

I also feel compelled to note that the for loop version is much easier to parallelize than the while loop version because of the ease of switching to a parallel collection. I haven't done it here as one must make some alterations to prevent race conditions, but that is something that I might also explore in a future post.

Variables in for Comprehensions

There is one caveat to the conclusion that for loops don't hurt performance in the larger example. In the forSim method shown above, the variables pi and pj are both declared inside of the inner most loop. The for comprehension in Scala allows variables to be declared in the "header" section of the loop. When I first wrote this code, I declared pi between the two generators and pj right after the generator for j. One wouldn't think that this would matter much, but it did. Having the declarations up in the header instead of the body cause this code to run roughly 2.5-3x slower than when they were put as the first lines in the body of the for loop. I don't have an explanation for this behavior and I haven't explored the generated bytecode to see what might be causing it. However, based on this result, it is probably worth not using the variable declaration capabilities of for comprehensions if performance is critical to your application.

2 comments:

Steven WayneJuly 14, 2019 at 1:29 PM
Dr. Lewis, I am glad I found this blog site.
I have watched a lot of your videos about scala and scalafx
and have learned much. There have been many times I wanted to ask you a question but did not know how to reach you. I know this off topic of the above blog but if I may: Can the fast scala compiler be used with scalafx? Steve
Steven WayneJuly 14, 2019 at 4:42 PM
Dr. Lewis, I am glad I found this blog site.
I have watched a lot of your videos about scala and scalafx
and have learned much. There have been many times I wanted to ask you a question but did not know how to reach you. I know this off topic of the above blog but if I may: Can the fast scala compiler be used with scalafx? Steve

Sunday, January 1, 2017

Performance of Scala for Loops

Previous Code

N-Body Simulation

Variables in for Comprehensions

2 comments: