Wednesday, January 2, 2013

Loop Performance and Local Variables in Scala

This post was motivated by Local variables inside a loop and performance. I have to admit that the results of that post didn't really surprise me. For the code that they had written I felt like any reasonable compiler should have made the stack frame for the whole function big enough to hold those variables so that there was no additional work done for local declaration inside of a loop. What interested me though was whether the same would be true in Scala.

Why Scala Might Be Different
The for loop in Java is just a dressed up while loop. It is useful because it gives you places to put all the main components of a loop so that you are less likely to forget them. However, when you convert it to the assembly level (or bytecode level) it should have a conditional branch at the top and a non-conditional branch at the bottom with little additional overhead. In Scala however, the for loop is better described by the name "for comprehension". In a sense, there is no for loop in Scala. Instead, uses of for get translated to calls to higher order methods like map, flatMap, filter, and foreach. As an example, consider this for loop in Scala.

for (i <- 0 until runs) {
  val x = i % 12
  val y = i / 12 % 12
  val times = x * y
  counters(times) += 1
}

This gets translated to the following:

(0 until runs).foreach(i => {
  val x = i % 12
  val y = i / 12 % 12
  val times = x * y
  counters(times) += 1
}

This very simple example doesn't really show the power that is gained by this translation, but the capabilities of for comprehensions really do add a lot to Scala. Some of these features include simple data parallelism with parallel collections and concise syntax for Futures in the Akka framework. However, this greater power does come with some overhead in terms of performance. This change also makes it less clear to me if local variables in for loops would be significant.

Test Code
The following code shows how I modified the Java from Peter's blog post. The initial calls to the test functions are there to allow the JIT to go through the code before any timing is done. I also went a bit further and had each function call happen multiple times so that I could get a standard deviation in addition to average values. There are two functions using the for loop and two using the while loop.

object LocalVars {
  def main(args:Array[String]) {
    testInsideForLoop
    testOutsideForLoop
    testInsideWhileLoop
    testOutsideWhileLoop
    printResults("In for loop:",Array.fill(10000)(testInsideForLoop))
    printResults("Out of for loop:",Array.fill(10000)(testOutsideForLoop))
    printResults("In while loop:",Array.fill(10000)(testInsideWhileLoop))
    printResults("Out of while loop:",Array.fill(10000)(testOutsideWhileLoop))
  }

  def printResults(header:String,results:Array[Double]) {
    println(header)
    val average = results.sum/results.length
    printf("Average = %.3f ns\n", average)
    val aveSqr = results.map(x => x*x).sum/results.length
    printf("Standard dev = %.3f ns\n\n", math.sqrt(aveSqr-average*average))
  }

  def testInsideForLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    for (i <- 0 until runs) {
      val x = i % 12
      val y = i / 12 % 12
      val times = x * y
      counters(times) += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testOutsideForLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var x, y, times = 0
    for (i <- 0 until runs) {
      x = i % 12
      y = i / 12 % 12
      times = x * y
      counters(times) += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testInsideWhileLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var i = 0
    while (i < runs) {
      val x = i % 12
      val y = i / 12 % 12
      val times = x * y
      counters(times) += 1
      i += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testOutsideWhileLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var x, y, times, i = 0
    while (i < runs) {
      x = i % 12
      y = i / 12 % 12
      times = x * y
      counters(times) += 1
      i += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }
}

Results
I compiled the above code using using Scala 2.10.0-RC5 with the -optimise option. When I ran it I got the following output.

In for loop:
Average = 6.073 ns
Standard dev = 0.448 ns

Out of for loop:
Average = 6.084 ns
Standard dev = 0.507 ns

In while loop:
Average = 4.670 ns
Standard dev = 0.322 ns

Out of while loop:
Average = 4.669 ns
Standard dev = 0.320 ns


It is clear that where you declare local variables doesn't matter at all as all variations are inside the error bars for the measurement. However, there is a reasonable performance penalty for using the for loop. The ScalaCL project provides one approach to getting around this. Since ScalaCL is not up to date with 2.10 I didn't test this out. Earlier testing under 2.8 showed me that it generally did bring the performance of for loops in line with while loops. I would love to see the ScalaCL project get more support both for this reason and to improve the OpenCL collections.

More Questions
I have two problems with what is done here. First, my Scala code is not at all Scala-like. Part of this is because Peter picked a problem that is best solved in a very imperative way. It would be interesting to pick a problem that also has a nice functional solution and then play with that a bit. This would also make it possible to try seeing how using parallel collections could be beneficial.

A more significant question was posed in a discussion on Google+ by one of my former students, Will Shepherd. This is the question of what happens with memory if the variables are a bit more complex. Here again, the chosen example code holds us back a bit as there doesn't seem to be much need for non-primitive variables here. However, the results could be extremely different if the variable were to hold an array or some type of mutable object.

Perhaps someone can suggest a problem that would make some sense in exploring these two avenues.

Conclusion
My take away message from Peter's original blog, which I think hold true here as well is that your best bet is to write good code first, then worry about performance as needed. Declaring variables locally is just a good idea. Limiting scope is remarkably helpful in preventing bugs. So that is how you should write your code. If you find that your code isn't fast enough once you have it working, then you explore options for making it faster. In the case of local declarations of primitive variables in loops, I think it is clear that there isn't any point in even trying that route. However, if you do decide to start exploring optimizations, you need to be careful and meticulous. Re-benchmark after each "optimization" because it is possible that you might make a change which leads to uglier, harder to maintain code without getting any speed benefit at all.