Saturday, August 31, 2013

import != #include

I spent my summer doing software development for a local company and most of what I did was structural work on a large C++ code base with a long history. Since I am teaching a data structures course this coming fall that will use C++, I figured it would be good experience. It certainly gives me some stories to tell, but it has also helped to bring into sharper relief the differences between C++ and Java (as well as all the languages that have been influenced by Java).

I will write another post giving more of my thoughts on C++11, but I should mention right off the bat here that I think C++11 has a lot of really cool features that greatly improve the language. Developing modern code in C++ is not a bad thing, but the legacy of C++ means that not all code written today has to use the modern style, and worse, the code currently in existence doesn't use these features at all. Plus, even with the improvements C++ still uses pretty much the same tool chain as C and that is a bit of a problem.

How They Differ

I've always known that import and #include do different things. I make sure to point this out to students any time I am teaching a class where I can compare languages that use these two different features. However, working on a large C++ code base made the difference really stick out. What I came to realize is that #include causes a problem because it impacts the structure of code. This isn't an issue with import because it does nothing more than tell the compiler how to resolve short names in a source file into fully specified names.

The difference becomes more obvious when you run into situations where the order of #includes in a source file is important. I have to point out that if one follows all the normal best practices this never happens. Unfortunately, not everyone has followed these practices. In particular, there are Windows libraries that do things which break the rules and cause the order of includes to matter. The Windows libraries also have odd behaviors where you aren't allowed to include some files directly because they depend on other things being defined first. This can be particularly challenging when you are working on a project and the IDE tools do a great job of telling you exactly what header file defines something you need, but you aren't allowed to include that file because Microsoft did something non-standard in building their libraries. (This comes up a lot with their typdefs of BOOL, TRUE, and FALSE. That topic is probably worth a whole blog post to rant about.)

In some ways, I feel that the real problem is that header files can, and often must, include other header files. Because of this, putting a single #include in a source file can result in 100 or more other headers being included. Mix in a few #ifdef or #ifndef directives and things quickly become a complete mess where order matters a lot.

How This Happens

Now it is easy to throw stones at previous developers (including those for Windows) and say they just didn't know what they were doing. Inevitably there are situations where previous developers made some poor decisions that led to structural code problems in the headers. However, many of these things can creep into code over time and maintenance programmers can easily add them in not realizing what they are doing. The reason is that some flaws in code related to #includes and headers are hard to track down unless you have a powerful static analysis tool to help you. For example, files should #include all the things that they use and not things that they don't. Sounds like a simple enough rule to follow when you are the original author of a file. The compiler won't like your code if you don't #include things you are using, and unless you just have a bad habit of adding lots of #includes at the top of every file because you "might use them" you aren't going to put in extra stuff.

However, even with the original author there can be some challenges if you rely on the compiler to tell you when you are including everything that you need. If you have one header file that includes a lot of others, it is possible you might include that one file and forget to include the others directly even though you use things in them. This doesn't sound like a problem until you, or someone else, makes a change in what that one header file includes and your source files break because it wasn't doing its own includes directly. Relying on one file to do things for you that aren't really part of its job description is generally a great way to give yourself headaches later on.

When you consider the situation of the maintenance programmer, things get much worse, especially if the code was a bit smelly to start with. It is easy to add code and just say that if it compiles everything is happy. It takes time and effort to go to the top of the file and see if there is already a #include for the things you just added in. The time and effort grow if the file is longer than it should be. Not only do you have to jump farther from your current editing point to check, but the length of the #include list generally grows as well.

The problem is even worse when you are deleting a function, method, or even a few lines of code. Figuring out if you just deleted the only reference to something in a particular #include is not a trivial task. As a result, #includes, once added, are unlikely to never go away until someone decides to spend some real time doing cleaning or if you have a static analysis tool powerful enough to tell you that some particular #include is no longer needed.

It Made Sense for C

So why was such an odd system put into place to begin with? Well, it made sense for C, which ran on limited systems and very strictly followed the requirement of everything happening in a single pass. There were lots of hardware reasons why the original C compilers needed to go through their programs in a single pass from top to bottom and not store lots of extra tables for things along the way. When your program is stored on a tape (or punch cards) and your machine has limited memory, you don't want to have to run through the source multiple times in the process of compiling.

What changed?

Of course, most of those reasons are completely moot on a modern machine. This is why we have seen a march of programming languages that move more and more of the work onto the compiler. Focusing on Java, the import statement doesn't actually do anything to the code, it just tells the compiler how to resolve short names into the longer, fully specified names that they represent. (Honestly import is really the equivalent of using in C++, not #include.)

Faster machines with more memory and the fact that you were never compiling something stored on a tape made multiple passes and random access far more acceptable. So you don't have any need to have the preprocessor spit out something that can be handled in one pass from top to bottom. You don't mind if the compiler has to go looking in separate files. In fact, that can be faster. The whole idea of precompiled headers in C and C++ only exists because opening and processing the same header files over and over for every source file in a large project can really slow things down. Losing the concept of a header file removes that overhead from the compiler. (I also really appreciate that it removes code duplication. Having to touch two files any time a function signature is adjusted has always annoyed me.)

Making import work

In Java, a significant piece of what made this possible working with the computers available in the mid 90s was that there were very specific rules that had to be followed related to file names and directories. The fully specified name in Java tells the compiler exactly where it should go looking for something. So all the compiler has to do is figure out the fully specified name and that is exactly what import statements do in the code, allow short names to be turned into fully specified names.

Newer languages have relaxed some of Java's strict naming rules and have put even more burden on the compiler. Scala comes to mind as a key example of this. It compiles to the JVM and the .Class files are placed in the restricted directories required by the JVM, but the compiler actually takes its cues from the source code. Of course, it is generally recommended that you follow the Java scheme because it makes it easier for programmers to find where things are as well.


My main conclusion from all of this was that the decision to leave behind #include semantics and switch to import semantics was a huge step forward in programming. I know that I first saw import in the Java language, but I expect that it dates back to something earlier. Perhaps someone can leave a comment on the origin of import semantics.

Tuesday, August 27, 2013

Why does Klout undervalue Google+

Let me start off by saying that I'm a fan of Klout. I think that they are a very interesting metric of social network activity. I like the idea of an attention based economy and I think that Klout is a potential model for how that could begin. Plus there is the Scala factor. Given the I am a Scala zealot I love that Klout not only uses Scala but that they write about it in their engineering blog. I think they give great publicity to the language. I am also very much looking forward to having Klout include other networks in their scores. Specifically, I would love to see the inclusion of YouTube and Blogger.

Having said all of that, there is one thing that has been really bothering me. I feel that Klout dramatically undervalues Google+. To illustrates this we can look at the my profile and the network breakdown on Klout.

You can see that Klout says that most of my score comes from Facebook followed by Twitter and then Google+ right behind that. They also provide some summary information for what they base this on including friends, followers, and interactions on the different networks.

One of the things that Klout isn't showing under the Network information is how many people have me in circles on Google+. I find this very odd given that they do list number of friends for Facebook and number of followers on Twitter. So you can see in the figure above that I have less than 200 Twitter followers and under 600 Facebook friends. To compare this to Google+ I've included my CircleCount page here.

You can see from this that I have over 8,500 followers on Google+. I'll be the first to admit that I don't have a lot of engagement with the vast majority of those followers, but I typically get 10+ notifications of some type of engagement every day. I probably get a bit more engagement on Facebook, but I have much, much less on Twitter.

So my question is, why does Klout say that Twitter is slightly more significant than Google+ and that Facebook is more than 6x as significant? Is there something about the API that Google provides that prevents them from getting the needed data to consider Google+ appropriately or is their formula simply messed up? I'm hoping someone working at Klout might see this and comment. Maybe even look into it and consider tweaking that part of their formula in the next major revision. I realize it is very hard to compare across networks and that probably leads to differences, but this seems a bit extreme to me. I also feel for the Klout developers as they look at how to integrate Blogger, YouTube, and others.