Friday, November 29, 2013

Benefits of Scala in CS1

In my last post I described why Trinity decided to give Scala a go as our language of instruction for CS1 and CS2. In this post I'll run through many of the key aspects of the language that I see as being significant for teaching CS1. I should note that when we started using Scala one of the challenges we faced was a lack of teaching material. To address this, I wrote my own. The text of that material came out as a textbook published by CRC Press in the fall of 2012. Since the book came out I have focused more on making screencast videos of live coding to support a flipped classroom style of teaching. In many ways, what I describe here follows my approach to CS1, which is not purely functional and includes many imperative elements. I truly hope that Scala will catch on stronger in the CS education market and someone will write a textbook for CS1 and CS2 using Scala that has a more purely functional approach. Trinity has a separate course on functional programming so we really don't want to focus on that in CS1.

The REPL and Scripting

As I mentioned in the last post, my real epiphany came when I realized that some of the key features that make Python so great for teaching introductory CS apply just as well to Scala. In particular, the REPL and the scripting environment really stand out as being beneficial. I also feel that having the type inference in the REPL is particularly helpful early on. Just consider the following part of a REPL session.
scala> 5
res0: Int = 5

scala> 4.3
res1: Double = 4.3

scala> 5+7
res2: Int = 12

scala> 2+4.3
res3: Double = 6.3
Entering a few simple expressions with literals and mathematical operators allows you to demonstrate each of the different types and what they do. Just like in Python, the REPL gives students the ability to explore on their own and test things out in an environment with immediate feedback. The biggest difference is that when the values are displayed, students get to see their types as well.

The scripting environment also makes the first programs that you write extremely simple. Here is the canonical "Hello World" program written as a Scala script.
println("Hello world!")
Compare that to what you would write in Java. Even the command for printing is more like what you expect from Python than what you get with Java. The big difference is, of course, that the Scala is doing static type checks are reports type errors as syntax errors, though admittedly the line is a bit blurred with the REPL and scripts as there isn't a separate compile phase that is obvious to the user. If you enter a type mismatch in the REPL you will get output that looks like this.
scala> math.sqrt("64")
<console>:8: error: type mismatch;
 found   : String("64")
 required: Double
              math.sqrt("64")
                        ^
When writing scripts, it is helpful to have easy text input and output. The code above showed that printing is done with a call to println (or print if you don't want the line feed). Simple text input is just as easy with functions like readLine(), readInt(), and readDouble(). All of these read a full line then try to return the specified type. The downside is that you can't use readInt() when reading multiple numbers on one line. This is fine for the first programs and it isn't too long before students can read a line full of numbers with something like the following.
val nums = readLine().split(" ").map(_.toInt)
This code returns an Array[Int] with as many values as are entered on the given line of input. The split method is from the Java API and breaks a String into an array around some delimiter. Unfortunately, many educators who aren't familiar with functional programming have a negative reaction to map. They shouldn't, but I think there is a natural reaction that if they don't know something then it is obviously too complex for the novice. In reality, map is much simpler than the equivalent code in most other languages and students pick it up nicely when exposed to it early on.

Programming Environment/Tooling

One of the standard complaints about Scala is that the tooling isn't as advanced as other languages. Inevitably there is some truth to this as any new language starts with a disadvantage in that area. However, people leveling this complaint really need to have worked with Scala very recently for it to carry weight as the support environment around Scala is advancing by leaps and bounds.

In a Twitter conversation, Martin Odersky mentioned that he prefers IDEs for students because of the immediate feedback. There is no doubt that this is a great strength of IDEs. I also feel that in languages like Scala with type inference the ability of the IDE to display types is extremely beneficial. I use Eclipse for my own code development and I have felt that Eclipse works really well in the MOOCs that Martin has run on Coursera.

Despite these things, we use command line and a basic text editor in our CS1 and that works well with Scala. Inevitably part of our choice of these tools is historical. There are also many pedagogical reasons we choose to have students use these basic tools for CS1. They include things like leveling the playing field for experienced and novice programmers and our desire to have Linux and command line early in the curriculum. Comparing command line tools to Eclipse though I think the biggest reason to go command line in CS1 is the relative complexity of Eclipse. I agree that it is a great tool for the MOOCs, but the MOOCs have people like me signed up for them. My guess is that very few participants have never written a line of code before in their lives. Thanks to the current state of CS education in the US, 50% or more of my CS1 students have never seen or written code before entering my class. (I am hopeful that Code.org and the general push to include more coding in schools will help to change that.) The other half has taken a course using Java under Windows with an IDE in their High School. Starting off in Eclipse would give those experienced students even more of an advantage coming in. I'd rather give them something new to learn to keep them engaged.

One last point I would offer in support of command line is that I think it helps to instill some good practices. If you are editing your code in a text editor you really do need to stop every so often and make sure that what you have written does what you want. The lack of immediate feedback and hover-over types forces students to focus more on what they are doing. They have to think through the types of their expressions instead of relying on their tool to do that work for them. As a result, students truly appreciate Eclipse when I introduce them to it at the end of CS1 as we start the transition to object-orientation and CS2.

Regular Syntax

Another complaint that one often hears about Scala is that it is too complex. I think that this complaint has been well dealt with in other spaces, especially with the concept of there being different levels of Scala programming. I want to briefly dispell this for topics at the CS1 level. Indeed, I would argue that for CS1 Scala is significantly simpler than Java because it is more completely object-oriented and has a more regular syntax. One of the first examples of this that comes up in CS1 is what happens when you have to convert types.

Anyone who has taught CS1 knows that it doesn't take long before you run into situations where your students have one type and they need another. If they have an Int and need a Double everything is good. However, if they have a Double and need an Int then things aren't so simple. Even worse, they get a value as a String and they need a numeric type. We can illustrate the problem here using Java. If you have a double and need an int then you need to do a narrowing cast.
double estimate = Math.sqrt(input);
functionThatNeedsAnInt((int)estimate);
The first time you hit this in your course you have to take a little detour to describe some completely new piece of syntax. They won't see many other uses of syntax for a long time after this first introduction and hopefully when you actually get to inheritance hierarchies you spend some time telling students that narrowing casts are risky and lead to runtime errors if they aren't guarded with a check of instanceof.

The real problem here is that Java has primitive types. When Java was created this was a smart thing to include. Just look at the performance of Smalltalk to see what happens when you treated basic numeric types as objects in the mid-90s. However, they add a lot of complexity to the language including many details that students have to struggle with. Autoboxing hides some of the details associated with primitives in Java, but that doesn't mean students are freed from understanding the details. If anything, it just makes it harder for students to really understand.

The following week you run into the situation where students have a String and need a numeric type. So students naturally go for the obvious solution of a type case with code like (int)str. Only this doesn't work. So now you get to introduce something else new. In this case Integer.parseInt(str). You can either present this as a complete black box or you can choose to spend some time talking about the wrapper classes and static methods in them that can help you do things with primitives.

As promised, this is much simpler in Scala, in large part because everything is treated as an object at the language level. Yes, primitives still compile down to something different on the JVM to maintain speed, but that is an implementation detail, not something that matters to CS1 students who are trying to solve simple problems on a computer. In Scala you get to have code like this.
val estimate = math.sqrt(input)
functionThatNeedsAnInt(estimate.toInt)
println("Enter a number or quit")
val str = readLine()
if(str!="quit") {
  functionThatNeedsAnInt(str.toInt)
}
The use of methods like toInt is standard and uniform. You have the toString you are used to in Java, but it works on Int and Double too. You can also do other conversions that make sense such as toDouble from a String or toChar from and Int. Once you hit collections students find the use of toList and toArray to be perfectly natural.

One final point on primitives and the lack of them in Scala is the type names themselves. It seems like a small issue to capitalize types like int, but when you are working with the novice programmer, having to describe why int is lower case and String is capitalized actually matters. What is more, when you get to the point where students are creating their own types it is much easier to get them to adhere to the standard if every type they have ever seen starts with a capital letter. Telling them that variable and method names start lower case while type names start uppercase is much nicer if the language itself doesn't break that simple rule.

Getting rid of primitives isn't the only enhancement to the syntax that makes Scala work well for CS1. Scala lets you nest anything anywhere. The rules of Java, which can seem rather arbitrary to a novice programmer, aren't an issue in Scala. Helper functions can be nested inside of the function they help just as an if can be nested in another if. When you get to classes later on the same type of freedom applies there as well.

Another change in the language that I have found to really help in CS1 and CS2 is the removal of the static keyword. My personal experience with CS2 was that static was a concept that students really struggled with. The fact that this confusing keyword had to appear on the main method of their first program didn't help. Scala does away with the static keyword and instead provides object declarations that create singleton objects in the scope they are declared in. Since students are used to interacting with objects, this makes much more sense to them in CS1 when I can just say they are calling a method on an object instead of saying that they are calling a static method, like sqrt or parseInt on a class. Indeed, I don't even mention the word "class" in CS1 until we start bunding data together with case classes. Before that it is the objects that are important, not the classes.

To be fair, there is one minor challenge with the object declaration in Scala, and that is just one of terminology. It is challenging, when talking to students, to differentiate between usages of the word "object".

The last syntactic change that I have noticed, which I find helps make things more regular in Scala is that every declaration begin with a keyword. The code samples above show that val can be used to declare values. There is also a var keyword that creates what you normally think of as a variable. (val is like a final variable in Java.) Functions and methods are declared using the def keyword. There is a type keyword for making shorter names as well as class, trait, and object keywords for declaring those constructs. Again, the benefit here is regularity of syntax. It is easy to tell students that declarations start with the appropriate keyword and students can easily look at a declaration and know from the keyword what type of thing is being declared.

It is worth noting with val and var that Scala is making it easier to get students to use good style. In Java you really should make variables that don't change final. However, getting students to type those extra keystrokes is a bit challenging. I find it works much better to tell students to use val by default and only change things to a var if it is actually needed. Not all students will follow this recommendation. Some will get lazy and just make everything a var, but the percentage who do so is much smaller than if you are requesting the final keyword in Java.

Functional Aspects

This is where it is possible to lose a lot of instructors. Yes, the functional paradigm is not as widely spread as the impertive paradigm. However, that is largely a historical artifact and something that is changing. After all, the newest versions of both C++ (C++11) and Java (Java 8) include lambda expressions as a language feature. So the reality is that teachers are going to have to deal with some aspects of functional programming at some point. Why not do it in a language that had those features from the beginning instead of one where they were tacked on later. I promise you, the former option leads to a more regular syntax and usage than the later.

As was mentioned in the last post, I don't teach my Scala class from a purely functional perspective. It wouldn't be hard to do so. Simply skip the section on var and don't show the students the Array type, which is mutable, and what you are left with is very functional. However, I agree with the developers of the language that functional is great when it is great, but that there are places where it just makes more sense for things to be imperative. For that reason, I use functional and imperative aspects side-by-side. I highlight when things are mutable and when they aren't, something I already did in Java to explain why toUpperCase on a String didn't alter the original value. I use whichever approach makes the most sense for whatever we are doing. When neither approach is distinctly better I often ask students to do things multiple ways just to make sure that they understand the different approaches.

The first place where having functional programming really alters what I teach in CS1 is when I introduce lambda expressions/function literals right after talking about normal functions. We use this early on to make our functions more powerful. I don't abstract over types in CS1, but I can abstract functionality by passing in a function as an argument and I do spend time showing them this.

Where the lambda expressions really come into play is when I start dealing with collections. I teach both Arrays and Lists in CS1. One huge change from previous languages in that in Scala I teach these collections before loops. In Java or C there is only so much you can do with an array if you don't have loops. However, the higher order methods in Scala and the fact that the for loop is really a foreach allows collections to not only stand on their own, but to really belong before loops.

The power of the collections comes largely from the higher-order methods that take functions as arguments. The map and filter methods are the ones used most frequently, but there are many others available that make it easy to do fairly complex tasks with collections. There are also some more simple methods like sum that come in very handy for many of the simpler examples that are common for CS1. One of the standard examples of this that I give is the problem of finding all the minors in a list of people. Let's start with the Java code for doing this.
List minors = new ArrayList();
for(p:people) {
  if(p.age < 18) minors.add(p);
}
This code assumes that you are far enough along to be using java.util.ArrayList. If you are still working with just the built in arrays you need something a bit longer.
int count = 0;
for(p:people) {
  if(p.age < 18) count++;
}
Person[] minors = new Person[count];
int i = 0;
for(p:people) {
  if(p.age < 18) {
    minors[i] = p;
    i++;
  }
}
I could have taken out a line by reusing the count variable instead of introducing i as a variable, but such variable reuse is a risky habit to into.

So how does this look in Scala?
val minors = people.filter(_.age < 18) // could be p => p.age < 18
Now I have had some people try to convince me that this is harder to read. I would argue that is a clear indication that they aren't familiar with filter, because once you understand what filter means, the meaning of the Scala code is immediate. Once you feel comfortable with filter you would read this line of code as "declare minors with the value of the elements of people whose age is less than 18." What is more important in many situations is that the Scala code is also far less bug prone, and is especially resistant to logic errors. There aren't many things you can change in the Scala code that won't lead to a syntax error. Changing the < to a > or typing in the wrong number are about the only possibilities and both languages suffer from those.

You might wonder what the type of minors is in the Scala code. The answer is that it is whatever collection type people was to start with. If you were unhappy with that you could use methods like toArray or toList to convert it to the collection type you really wanted.

Now what if we wanted to find the average age of those minors? We can start again with the Java code.
int sum = 0;
for(m:minors) {
  sum += m.age;
}
double averageAge = (double)sum/minors.length; // or .size() for the ArrayList
Note that cast to a double to prevent this code from doing integer division. So what about in Scala?
val averageAge = minors.map(_.age).sum.toDouble/minors.length
// or
val averageAge2 = minors.foldLeft(0.0)((sum,m) => sum + m.age)/minors.length
// or
val averageAge3 = minors.foldLeft(0.0)(_ + _.age)/minors.length
I have presented three possibilities to illustrate some options. I expect my CS1 students to write the first version, where we use a call to map to pull out all of the age values. This version is less efficient because it actually creates a new collection. The other two are more equivalent to the Java code in how they run, but I have to be honest and say that students often do find the fold methods to me more difficult to reason about.

If one were looking for something to complain about, you might argue that the Java versions are teaching the students more about problem solving as they force them to break things down further. While I'm not at all convinced that this is true, I would certainly note that you can write code just like the Java version using Scala. The difference is that in Scala you don't have to. So I would show my students both approaches in Scala and expect that most will pick the more Scala-like option. The exceptions to that rule would likely be students who had previous experience in Java.

I also believe that at some level CS1 students can only handle a certain spread between the magnitude of the problems they are solving and the smallest details that they have to go down to. So if you force them to go to a lower level of detail in their solutions you aren't so much forcing them to learn more problem solving, but instead constraining the scope of the problems that you can ask them to solve.

There are a few other things I would want to comment on related to CS1, but this post is already getting to the TL;DR length so I'm going to stop this one here and save the rest for a later post. Your comments and thoughts are welcomed.

Thursday, November 28, 2013

Why Scala for CS1 and CS2?

+Thomas Lockney pointed me to a Tweet by +martin odersky today with a link to the following blog post: http://teaching.software-carpentry.org/2013/11/28/choosing-scala/. This has prompted me to write a bit about my own experience with teaching CS1 and CS2 using Scala. In this first post I will focus on why we chose Scala and continue to use it. In a second post I will run through what I see as the most significant benefits of Scala, especially in CS1.

I got the Odersky, et al. book and started learning about Scala in 2009 because I was working with students on some projects that involved the experimental X10 and Fortress languages and they both borrowed ideas from Scala. Indeed, the big push for me was that X10 switched from having a Java-based syntax to a Scala-based syntax around version 1.7. Since these experimental languages don't have learning guides, just language specs, I decided that I probably needed to actually learn Scala from a useful book before I tried to tackle X10 in its new form.

I will admit that for the first 200 pages or so I was very skeptical. There were a number of language decisions that simply didn't make sense to me. At that point they seemed ad hoc. However, by the time I was half way through the book I was hooked. I had loved programming in ML a few years prior, but I couldn't do much in it because of the lack of library support. Scala brought me many of the things I loved from ML and the full Java libraries that I had years of experience with. Even then though I was thinking about using it in upper division classes and I recall saying to myself that it couldn't work for introductory courses.

Around this time our department also started evaluating our curriculum to make sure that everything was in order with ACM guidelines and to try to patch issues that we felt existed in how we were doing things. In Fall 2002 we had adopted an approach where we taught CS1 with no OO in C, CS2 with heavy OO in Java, and our Data Structures course using C++. In many ways we were happy with this setup, but there was one glaring issue. The breadth of languages had many benefits, but at the end of the day, students didn't feel comfortable in any single language. For that reason, we wanted to change things so that CS1 and CS2 were taught in one language while Data Structures and a new course on Algorithms were taught in another. It was fairly easy to pick C++ for the 3rd and 4th semesters. (I'm sure there could be endless discussion on that and comments are welcomed, but I'll just say here that it works well for us and we believe it works well for the students.) However, the choice for CS1 and CS2 was more challenging.

We were happy with focusing on logic in the first semester and holding off real OO until the second semester. After all, OO really shines on large projects and CS1 is not about writing long programs. CS1 is about figuring out how to break down problems and construct logic in the formal terms of a programming language. I'm also not a fan of using tools explicitly designed for education in majors courses. Tools like BlueJ, Greenfoot, and DrJava can be great for getting around some of the problem areas of Java early on and we are happy using them for non-majors, but we would rather work with "real" tools with majors.

We also had to consider the requirements of CS2. This isn't just a class about OO. It is also about abstraction and polymorphism. We want students to understand things like inheritance hierarchies and parametric polymorphism (generics/template/type parameters). We think it is nice to throw in things like GUIs, graphics, multithreading, and networking along the way in CS1 and CS2 to allow students to build applications they find more interesting and because we want to make sure those things are seen by everyone, even if they skip one or two of those topics in their upper division courses.

We had pretty much ruled out Java because of the tool issue and the fact that the language just doesn't support programming in the small well on its own. The keyword soup that is "Hello World" in Java wasn't an option for us. In many ways, Python was the early front runner at Trinity, as it has become in many other departments looking to change the early curriculum. Python works wonderfully for CS1 and has a feature set that fits well with what we want to teach in that course. Unfortunately, Python doesn't fit our learning objectives for CS2. The OO isn't very pure, and many of the topics we want to cover really only make sense in a statically typed language.

It was during these discussions about languages that I had my Scala educational epiphany. I realized that many of the things that made Python great for CS1 were also features of Scala. The availability of a REPL and a scripting environment meant that Scala was just as good for writing small programs and testing things out as Python or other scripting languages. Unlike Python though, Scala was even better than Java at handling the OO and related topics in CS2.

What we decided to do was run an informal experiment. For 1.5 years we ran sections using our old approach, Python, and Scala to see if there were any glaring holes. While everyone felt that each style did a reasonable job, there was some anecdotal evidence that students who started in Python had a slightly harder time moving to C/C++ because of little things like having to declare variables and worry about types. This isn't something that would be likely to be seen in final outcomes from the major, but it is the type of thing that can slow the class down the next semester and prevent them from getting quite the same depth or breadth.

In many ways, the decision was made to go with Scala because the primary supporter of Python retired and I didn't. We are now in the first half of the 4th year using Scala and the 2nd full year where all sections are using Scala. In general, I think the decision to go with Scala has served us well. In my next blog post, I will go into the details of how I use Scala in CS1 and the features that I think are most useful for that course.

Saturday, August 31, 2013

import != #include

I spent my summer doing software development for a local company and most of what I did was structural work on a large C++ code base with a long history. Since I am teaching a data structures course this coming fall that will use C++, I figured it would be good experience. It certainly gives me some stories to tell, but it has also helped to bring into sharper relief the differences between C++ and Java (as well as all the languages that have been influenced by Java).

I will write another post giving more of my thoughts on C++11, but I should mention right off the bat here that I think C++11 has a lot of really cool features that greatly improve the language. Developing modern code in C++ is not a bad thing, but the legacy of C++ means that not all code written today has to use the modern style, and worse, the code currently in existence doesn't use these features at all. Plus, even with the improvements C++ still uses pretty much the same tool chain as C and that is a bit of a problem.

How They Differ

I've always known that import and #include do different things. I make sure to point this out to students any time I am teaching a class where I can compare languages that use these two different features. However, working on a large C++ code base made the difference really stick out. What I came to realize is that #include causes a problem because it impacts the structure of code. This isn't an issue with import because it does nothing more than tell the compiler how to resolve short names in a source file into fully specified names.

The difference becomes more obvious when you run into situations where the order of #includes in a source file is important. I have to point out that if one follows all the normal best practices this never happens. Unfortunately, not everyone has followed these practices. In particular, there are Windows libraries that do things which break the rules and cause the order of includes to matter. The Windows libraries also have odd behaviors where you aren't allowed to include some files directly because they depend on other things being defined first. This can be particularly challenging when you are working on a project and the IDE tools do a great job of telling you exactly what header file defines something you need, but you aren't allowed to include that file because Microsoft did something non-standard in building their libraries. (This comes up a lot with their typdefs of BOOL, TRUE, and FALSE. That topic is probably worth a whole blog post to rant about.)

In some ways, I feel that the real problem is that header files can, and often must, include other header files. Because of this, putting a single #include in a source file can result in 100 or more other headers being included. Mix in a few #ifdef or #ifndef directives and things quickly become a complete mess where order matters a lot.

How This Happens

Now it is easy to throw stones at previous developers (including those for Windows) and say they just didn't know what they were doing. Inevitably there are situations where previous developers made some poor decisions that led to structural code problems in the headers. However, many of these things can creep into code over time and maintenance programmers can easily add them in not realizing what they are doing. The reason is that some flaws in code related to #includes and headers are hard to track down unless you have a powerful static analysis tool to help you. For example, files should #include all the things that they use and not things that they don't. Sounds like a simple enough rule to follow when you are the original author of a file. The compiler won't like your code if you don't #include things you are using, and unless you just have a bad habit of adding lots of #includes at the top of every file because you "might use them" you aren't going to put in extra stuff.

However, even with the original author there can be some challenges if you rely on the compiler to tell you when you are including everything that you need. If you have one header file that includes a lot of others, it is possible you might include that one file and forget to include the others directly even though you use things in them. This doesn't sound like a problem until you, or someone else, makes a change in what that one header file includes and your source files break because it wasn't doing its own includes directly. Relying on one file to do things for you that aren't really part of its job description is generally a great way to give yourself headaches later on.

When you consider the situation of the maintenance programmer, things get much worse, especially if the code was a bit smelly to start with. It is easy to add code and just say that if it compiles everything is happy. It takes time and effort to go to the top of the file and see if there is already a #include for the things you just added in. The time and effort grow if the file is longer than it should be. Not only do you have to jump farther from your current editing point to check, but the length of the #include list generally grows as well.

The problem is even worse when you are deleting a function, method, or even a few lines of code. Figuring out if you just deleted the only reference to something in a particular #include is not a trivial task. As a result, #includes, once added, are unlikely to never go away until someone decides to spend some real time doing cleaning or if you have a static analysis tool powerful enough to tell you that some particular #include is no longer needed.

It Made Sense for C

So why was such an odd system put into place to begin with? Well, it made sense for C, which ran on limited systems and very strictly followed the requirement of everything happening in a single pass. There were lots of hardware reasons why the original C compilers needed to go through their programs in a single pass from top to bottom and not store lots of extra tables for things along the way. When your program is stored on a tape (or punch cards) and your machine has limited memory, you don't want to have to run through the source multiple times in the process of compiling.

What changed?

Of course, most of those reasons are completely moot on a modern machine. This is why we have seen a march of programming languages that move more and more of the work onto the compiler. Focusing on Java, the import statement doesn't actually do anything to the code, it just tells the compiler how to resolve short names into the longer, fully specified names that they represent. (Honestly import is really the equivalent of using in C++, not #include.)

Faster machines with more memory and the fact that you were never compiling something stored on a tape made multiple passes and random access far more acceptable. So you don't have any need to have the preprocessor spit out something that can be handled in one pass from top to bottom. You don't mind if the compiler has to go looking in separate files. In fact, that can be faster. The whole idea of precompiled headers in C and C++ only exists because opening and processing the same header files over and over for every source file in a large project can really slow things down. Losing the concept of a header file removes that overhead from the compiler. (I also really appreciate that it removes code duplication. Having to touch two files any time a function signature is adjusted has always annoyed me.)

Making import work

In Java, a significant piece of what made this possible working with the computers available in the mid 90s was that there were very specific rules that had to be followed related to file names and directories. The fully specified name in Java tells the compiler exactly where it should go looking for something. So all the compiler has to do is figure out the fully specified name and that is exactly what import statements do in the code, allow short names to be turned into fully specified names.

Newer languages have relaxed some of Java's strict naming rules and have put even more burden on the compiler. Scala comes to mind as a key example of this. It compiles to the JVM and the .Class files are placed in the restricted directories required by the JVM, but the compiler actually takes its cues from the source code. Of course, it is generally recommended that you follow the Java scheme because it makes it easier for programmers to find where things are as well.

Conclusion

My main conclusion from all of this was that the decision to leave behind #include semantics and switch to import semantics was a huge step forward in programming. I know that I first saw import in the Java language, but I expect that it dates back to something earlier. Perhaps someone can leave a comment on the origin of import semantics.

Tuesday, August 27, 2013

Why does Klout undervalue Google+

Let me start off by saying that I'm a fan of Klout. I think that they are a very interesting metric of social network activity. I like the idea of an attention based economy and I think that Klout is a potential model for how that could begin. Plus there is the Scala factor. Given the I am a Scala zealot I love that Klout not only uses Scala but that they write about it in their engineering blog. I think they give great publicity to the language. I am also very much looking forward to having Klout include other networks in their scores. Specifically, I would love to see the inclusion of YouTube and Blogger.

Having said all of that, there is one thing that has been really bothering me. I feel that Klout dramatically undervalues Google+. To illustrates this we can look at the my profile and the network breakdown on Klout.


You can see that Klout says that most of my score comes from Facebook followed by Twitter and then Google+ right behind that. They also provide some summary information for what they base this on including friends, followers, and interactions on the different networks.

One of the things that Klout isn't showing under the Network information is how many people have me in circles on Google+. I find this very odd given that they do list number of friends for Facebook and number of followers on Twitter. So you can see in the figure above that I have less than 200 Twitter followers and under 600 Facebook friends. To compare this to Google+ I've included my CircleCount page here.


You can see from this that I have over 8,500 followers on Google+. I'll be the first to admit that I don't have a lot of engagement with the vast majority of those followers, but I typically get 10+ notifications of some type of engagement every day. I probably get a bit more engagement on Facebook, but I have much, much less on Twitter.

So my question is, why does Klout say that Twitter is slightly more significant than Google+ and that Facebook is more than 6x as significant? Is there something about the API that Google provides that prevents them from getting the needed data to consider Google+ appropriately or is their formula simply messed up? I'm hoping someone working at Klout might see this and comment. Maybe even look into it and consider tweaking that part of their formula in the next major revision. I realize it is very hard to compare across networks and that probably leads to differences, but this seems a bit extreme to me. I also feel for the Klout developers as they look at how to integrate Blogger, YouTube, and others.

Tuesday, March 19, 2013

Why coding isn't like working a printing press

Yesterday I was in a meeting with other STEM faculty discussing the proposed new curriculum for Trinity. Toward the end of the meeting I turned the discussion toward the Digital Literacy Competency aspect of the proposal. I was on the committee that created the original version of that proposal, and I felt that it had been watered down a bit. In particular, the term "algorithm" had been removed. The goal that I have for that part of the curriculum is to make it so that every student graduating from Trinity has written a little code to solve a problem. There are lots of places where people have argued for the importance of this. Some of my favorites are the blog post "Should you learn to code?" and the videos at Code.org. (If you haven't seen these PLEASE take a minute to look at them. Technically it will be 5 minutes for the normal version of the Code.org video.)

What really surprised me at this meeting was how many of the STEM faculty were opposed to this idea. I used the analogy of code being a new form of literacy and a few people commented how they didn't know how to use a printing press. Later that night I realized what I should have said in response to that to illustrate that coding is nothing like being the person who runs a printing press. Here are some reasons. First, you aren't surrounded by printing presses. You are surrounded by computing devices, just like you are surrounded by books. Second, running a program someone else wrote is, at best, like reading a book. You are consuming what someone else produced. I don't think any of my peers would argue that we would be happy having students who can only read and consume written information, but are incapable of writing their own. The goal of this capacity is to give students the first glimpses of how they produce in the digital world. If you have never seen programming, you are stuck as a consumer. You can't produce anything on the devices that surround you every day.

Note that I'm not asking that every student can write the equivalent of a novel. We don't do that in English either. In fact, I'm not even asking that they be able to write the equivalent of a five page essay. I'm asking that they have the experience of writing a little script to solve some problem. I would argue that is similar to asking that they know how to compose a few sentences like what kids do in early elementary school. In fact, there is a movement growing to have students doing this at the elementary school level, and the US is behind the curve in this area (see Code.org and http://neil.fraser.name/news/2013/03/16/).

Why do I think this is so important? There are several reasons. We are supposed to be preparing students to live in the 21st century. Most of our students already carry a computing device, in the form of a smart phone, with them more than they do books, yet they are completely incapable of creating any level of content on that device. They are literally marginalized by that inability.

In addition, programming is all about problem solving. Honestly, I think that the formalized thinking of programming is more important for students to learn today than even a foreign language. Both have cognitive advantages to those who learn them, but the way the world is moving, I believe that the thinking skills that go into programming are the more significant of the two.

Going further with the topic of problem solving, the reality is that more and more problems are becoming solvable only with the use of a computer. For the most part this is happening because the data sets involved in solving these problems have gotten to the point where you simply can't manipulate them by hand. This isn't just true in the sciences, big data sets are now the norm for social sciences and many humanities as well. The argument was raised that students can just use existing programs to solve those problems. To my mind, that's like saying that students don't need to know how to write because they can just copy and paste the ideas that other people have written. Also, just like the writing, I would argue that we want our students solving novel problems so eventually they want to say something that hasn't been said before. If that happens and they don't know how to write, they are stuck. Same thing applies to code. If you need to solve something and can't find existing programs for it, you either write it yourself or you are stuck. Knowing how to code gives you the advantage of at least being able to consider writing it yourself.

Once again, I'm definitely not arguing that everyone be a CS major. In fact, I don't even want these courses taught in CS. I think that CS will have to help other faculty with them, especially early on, but I think that this will be most effective if it is done in the context of students using computers to solve problems in other domains that they are truly interested in. I made the analogy to reading and writing above, but an analogy to math works just as well. We want every one of our students to know algebra. Why? Because there are many problems that are easy to solve if you know algebra and much harder if you don't. We don't want to teach algebra so that every student will be a math major. Math majors do many things that aren't even touched upon in algebra courses. In this situation, the algebra is a tool that is useful to people. As computing devices become more widespread and the data sets that are significant for problems in all fields grow, the significance of programming increases as well. Writing a script becomes just as essential a capability as writing a short letter or solving a little equation. At least that I how it seems to me. Comment with your own thoughts.

Addendum: Aaron Delwich pointed out how important it is that the computer is a universal machine. A printing press prints books, nothing else. It is specialized. One of my colleagues at the meeting also mentioned a car, saying he could argue that we should teach all students about internal combustion engines, but cars are also very special purpose. The computer is universal. It can do any operation you want with information. That is why they are used in every field to solve all kinds of problems and why learning about them is not the same as learning to operate a printing press or how an engine works.

Thursday, February 21, 2013

Are the Humanities Killing Themselves?

This blog post, as with many of the others I have written recently, focuses on the efforts of some to change the standard student workload at Trinity to 4 courses/semester each counting for 4 credits (a 4-4 load) instead of the current 5 courses of 3 credits. This proposal normally also has attached to it a change in teaching load so that faculty teach 3 courses one semester and 2 the next instead of 3 courses every semester [*].  This change will have a lot of implications, but one interesting point to note is that, if I paint with overly broad strokes, most of the support for this comes from the humanities and social sciences, while most of the opposition comes from the STEM and pre-professional departments [**]. Something I have discussed with others, and which I truly believe to be the case, is that the departments supporting this move, along with a proposed change to our common curriculum, do so at their own peril. I want to elaborate on this idea here in part to record it and perhaps to bring it forward for broader discussions.

The move to the 4-4, and the new curriculum proposal that will come with it have one inevitable consequence, students will take a smaller number of courses and will get less diversity in the departments they are exposed to. The 4-4 makes that first point unavoidable. The second point happens because our current breadth requirements would be impossible to maintain under the 4-4 and the new proposal, which I should point out I support, does not require students to take courses in as many different departments.

So why do I think that this should bother the faculty in the humanities? Well, to pick on one particular major, how many people enter college saying they want to major in Religion? Of course, Religion isn't alone. There are a number of majors that only get majors by having them take introductory courses and fall in love with the topic. Some are even in STEM. The Geoscience department comes to mind as an example. This alone should probably give those departments second thoughts, but I am sure they can easily say that it will improve their departments in other ways so it is worth it.

What I feel they ignore though is the changing national view on the role of higher education. More pressure is being put on the idea that students should major in something that will get them jobs. This goes along with the complaints that students are leaving college with too much debt and then they aren't able to find jobs. This is why we see proposed legislation that will force colleges to report earnings of graduates by major.

I am personally already feeling some of the fallout of the change in how students are picking majors. I have 35 students in my CS2 courses at Trinity. We haven't had that many students go on to the second semester in a decade. CS is one of the few areas where people hear about there being lots of jobs and not enough people to fill them these days. It might still be true that no one goes to college thinking they want to major in Geoscience, but if they look at the incomes associated with that major they can probably be convinced.

What about other departments like Religion and History? I expect they are going to take a hit from this in the next few years as more and more students enter college thinking about their earning potential. Even if the Wyden-Rubio legislation does not pass, they have access to things like the "What's it Worth?" report for Georgetown. They will also find things like this Forbes article entitled "The 10 Worst College Majors", which slams most humanities majors for high unemployment and low incomes. So why would these departments support changes that are going to likely reduce the number of students coming into their classrooms and hence their ability to attract majors? Obviously I don't have an answer to that because it makes no sense to me, but I want to dig a bit deeper to make it clear what could be at stake for these departments.

Let's pick the top two departments from the top of the What's it Worth salary survey and some comparison humanities departments and look at faculty counts.

Engineering Science - 9
Computer Science - 7
Art and Art History - 12
History - 11
Religion - 9

Right now the humanities departments have as many or more faculty than the majors that will inevitably be getting a lot more attention if students and parents really start looking at ROI on a college education. (I know that liberal arts are all about the additional benefits of being broadly educated, but I promise you that parents are not blind to the dollar signs when they are looking at colleges for their kids. As such, this can't be ignored no matter how good the arguments are that some things matter more than money.) So the move to the 4-4 and the new curriculum will contract the distribution requirement and pull students out of almost every department, making major counts more important. Then, if we really do get a push toward majors that lead to jobs it seems to me that a lot of the humanities departments might have a hard time filling their classrooms and justifying their current faculty counts.

Note that I'm not wishing this on them or saying this is a good thing. Quite the opposite. By opposing the 4-4, I'm pushing to keep students taking a broader distribution of courses and make it so that even if more students choose to major in the sciences, the humanities will retain reasonable enrollments. I'm writing this because I don't think most of the humanities faculty see it that way and I'm wondering what they expect to see happen with their total head count in the next five years if they pass the 4-4 and the new curriculum.

[*] This isn't really a drop in teaching load. The current system has faculty teaching a total of 18 credit hours each year. The altered system would have them teaching 20.

[**] This is an overly broad generalization. My guess is that this post will be read by at least one Communications professor who doesn't favor the proposal. It is also problematic for Music. In addition, I'm sure a few STEM faculty are supportive. However, in general this dividing line holds.

Wednesday, January 2, 2013

Loop Performance and Local Variables in Scala

This post was motivated by Local variables inside a loop and performance. I have to admit that the results of that post didn't really surprise me. For the code that they had written I felt like any reasonable compiler should have made the stack frame for the whole function big enough to hold those variables so that there was no additional work done for local declaration inside of a loop. What interested me though was whether the same would be true in Scala.

Why Scala Might Be Different
The for loop in Java is just a dressed up while loop. It is useful because it gives you places to put all the main components of a loop so that you are less likely to forget them. However, when you convert it to the assembly level (or bytecode level) it should have a conditional branch at the top and a non-conditional branch at the bottom with little additional overhead. In Scala however, the for loop is better described by the name "for comprehension". In a sense, there is no for loop in Scala. Instead, uses of for get translated to calls to higher order methods like map, flatMap, filter, and foreach. As an example, consider this for loop in Scala.

for (i <- 0 until runs) {
  val x = i % 12
  val y = i / 12 % 12
  val times = x * y
  counters(times) += 1
}

This gets translated to the following:

(0 until runs).foreach(i => {
  val x = i % 12
  val y = i / 12 % 12
  val times = x * y
  counters(times) += 1
}

This very simple example doesn't really show the power that is gained by this translation, but the capabilities of for comprehensions really do add a lot to Scala. Some of these features include simple data parallelism with parallel collections and concise syntax for Futures in the Akka framework. However, this greater power does come with some overhead in terms of performance. This change also makes it less clear to me if local variables in for loops would be significant.

Test Code
The following code shows how I modified the Java from Peter's blog post. The initial calls to the test functions are there to allow the JIT to go through the code before any timing is done. I also went a bit further and had each function call happen multiple times so that I could get a standard deviation in addition to average values. There are two functions using the for loop and two using the while loop.

object LocalVars {
  def main(args:Array[String]) {
    testInsideForLoop
    testOutsideForLoop
    testInsideWhileLoop
    testOutsideWhileLoop
    printResults("In for loop:",Array.fill(10000)(testInsideForLoop))
    printResults("Out of for loop:",Array.fill(10000)(testOutsideForLoop))
    printResults("In while loop:",Array.fill(10000)(testInsideWhileLoop))
    printResults("Out of while loop:",Array.fill(10000)(testOutsideWhileLoop))
  }

  def printResults(header:String,results:Array[Double]) {
    println(header)
    val average = results.sum/results.length
    printf("Average = %.3f ns\n", average)
    val aveSqr = results.map(x => x*x).sum/results.length
    printf("Standard dev = %.3f ns\n\n", math.sqrt(aveSqr-average*average))
  }

  def testInsideForLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    for (i <- 0 until runs) {
      val x = i % 12
      val y = i / 12 % 12
      val times = x * y
      counters(times) += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testOutsideForLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var x, y, times = 0
    for (i <- 0 until runs) {
      x = i % 12
      y = i / 12 % 12
      times = x * y
      counters(times) += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testInsideWhileLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var i = 0
    while (i < runs) {
      val x = i % 12
      val y = i / 12 % 12
      val times = x * y
      counters(times) += 1
      i += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }

  def testOutsideWhileLoop:Double = {
    val start = System.nanoTime()
    val counters = Array.fill(144)(0)
    val runs = 1000 * 1000
    var x, y, times, i = 0
    while (i < runs) {
      x = i % 12
      y = i / 12 % 12
      times = x * y
      counters(times) += 1
      i += 1
    }
    (System.nanoTime() - start).toDouble / runs
  }
}

Results
I compiled the above code using using Scala 2.10.0-RC5 with the -optimise option. When I ran it I got the following output.

In for loop:
Average = 6.073 ns
Standard dev = 0.448 ns

Out of for loop:
Average = 6.084 ns
Standard dev = 0.507 ns

In while loop:
Average = 4.670 ns
Standard dev = 0.322 ns

Out of while loop:
Average = 4.669 ns
Standard dev = 0.320 ns


It is clear that where you declare local variables doesn't matter at all as all variations are inside the error bars for the measurement. However, there is a reasonable performance penalty for using the for loop. The ScalaCL project provides one approach to getting around this. Since ScalaCL is not up to date with 2.10 I didn't test this out. Earlier testing under 2.8 showed me that it generally did bring the performance of for loops in line with while loops. I would love to see the ScalaCL project get more support both for this reason and to improve the OpenCL collections.

More Questions
I have two problems with what is done here. First, my Scala code is not at all Scala-like. Part of this is because Peter picked a problem that is best solved in a very imperative way. It would be interesting to pick a problem that also has a nice functional solution and then play with that a bit. This would also make it possible to try seeing how using parallel collections could be beneficial.

A more significant question was posed in a discussion on Google+ by one of my former students, Will Shepherd. This is the question of what happens with memory if the variables are a bit more complex. Here again, the chosen example code holds us back a bit as there doesn't seem to be much need for non-primitive variables here. However, the results could be extremely different if the variable were to hold an array or some type of mutable object.

Perhaps someone can suggest a problem that would make some sense in exploring these two avenues.

Conclusion
My take away message from Peter's original blog, which I think hold true here as well is that your best bet is to write good code first, then worry about performance as needed. Declaring variables locally is just a good idea. Limiting scope is remarkably helpful in preventing bugs. So that is how you should write your code. If you find that your code isn't fast enough once you have it working, then you explore options for making it faster. In the case of local declarations of primitive variables in loops, I think it is clear that there isn't any point in even trying that route. However, if you do decide to start exploring optimizations, you need to be careful and meticulous. Re-benchmark after each "optimization" because it is possible that you might make a change which leads to uglier, harder to maintain code without getting any speed benefit at all.