Thursday, May 15, 2014

Strong Typing vs. Weak Typing vs. Static Typing vs. Dynamic Typing

This post is basically a little rant of mine related to misuse of terms related to typing. In particular, I get frustrated when people use the term "strongly typed" when what they really means is "statically typed". There are also many situations where people will say that something is "weakly typed" when what they really mean is "dynamically typed". The key to understanding these things is that there are two different orthogonal issues here. The first is strong vs. weak typing and the second is static vs. dynamic type checking. Programming languages can exist in the space defined by these two independent axes.

Strong vs. Weak Typing

A strongly typed language is one where every statement that is executed must follow the proper rules of the type system. A weakly typed language is one where this isn't the case. It turns out that most all of the languages people are working with these days are strongly typed. People often don't think of languages like Python as strongly typed, but just try doing something that isn't allowed with a value in Python. The program will run, but only up until that point. Then it will crash. This same thing is true for pretty much all scripting languages. It is also true for most compiled languages.

Now, different languages might be more liberal with what is "allowed" by the type system. Languages like Perl and JavaScript take great liberties in converting things so that they "work". They might implicitly convert strings to numbers and back as needed for different expressions. However, those are the operations that are defined by the language for those types. One can come up with many reasons why doing these things is bad in large software development projects, but that is why they are scripting languages. Being liberal with type conversions doesn't really mean the same thing as saying that the language has weak typing.

So then what languages do actually have weak typing? What languages allow you to do things that are completely outside of the type system? The simplest answer to this is C/C++. To illustrate this, consider the following code, which compiles without complaint under g++ version 4.7.3 using the -Wall and -pedantic options.
int main() {
    double a = 5.0;
    double *d = &a;
    cout << a << endl;
    int *b = (int*)d;
    cout << b[0] << endl;
    cout << b[1] << endl;
    b[0] = 99;
    b[1] = 98;
    cout << b[0] << endl;
    cout << b[1] << endl;
    cout << a << endl;

    return 0;
}
Granted, this is something that would be considered very poor C++ code because you should never do C-style casts in C++. However, that type cast on line 5 is perfectly valid in both C and C++, even though it breaks the type system. Indeed, any code that you see in the C standard libraries that takes a void* works by virtue of the fact that the type system is weak and can be broken. In case you are wondering, the output for this code on my machine happens to be the following.
5
0
1075052544
99
98
2.07956e-312
This is allowed because there are times in system level programming when you have real pointers into memory and you really just want to force the type that you are dealing with at that address. Such activities are dangerous and generally not cross-platform compatible, but when you are doing system level programming you are programming for a particular system and don't really expect it to work on others.

In a certain sense, languages are either strongly typed or they are weakly typed. Either the language allows you to break the type system with something like the code shown above, or it doesn't. However, even languages that let you break the type system can discourage it. I'm sure that many C++ programmers would be unhappy with the characterization that C++ is weakly typed. The above code is proof that it is true, but a lot of effort has gone into making it so that the proper style of code in the newer releases of C++ is strongly typed. Well written C++ code will almost certainly be done using a subset of the language that is strongly typed. They just happen to have this hole in the type system for reasons of backward compatibility and for those rare situations when system level programming really does require that you treat a block of memory as something that it isn't. I should also note that doing this with floating point numbers, as shown in my example, would be much less common than doing it with integer types of different sizes.

Static vs. Dynamic Type Checking

The more interesting aspects of programming languages have nothing to do with strong vs. weak typing. They instead center on the way in which a language does type checking. The static vs. dynamic type checking distinction focuses on whether types are checked at compile time or during the runtime. A purely statically typed language would make certain that every type was correct at compile time and no checks would ever be done at runtime. A purely dynamic language doesn't check anything related to types a compile time and all those checks are done at runtime. (I use the term "compile time" loosely here because dynamically typed languages can be interpreted and might never go through a formal compilation step.)

Most of the arguments that people get into regarding type systems focus on this static vs. dynamic distinction and the way it is implemented. Proponents of statically type checked languages argue that they provide a certain level of safety in that the compiler does a formal proof that the types in the program are correct. Proponents of dynamically typed languages argue that the type systems associated with static type checking are overly restrictive and add a burden to the programming process. While I sit strongly on the side of static type checking, the purpose of this post is to draw distinctions in terminology, not to get involved in that debate, which has been hashed out in many places.

Where Languages Stand

In some ways, where a language fits into this depends a fair bit on how you program it. That isn't true for every language, though. Some languages have a very specific location in the plane of these two axes. Even for the languages that don't, we can probably say some general things about the standard styles in them. If you follow normal style rules, few languages are going to be weakly typed in practice. C is probably the weakest, simply because the primary method of creating polymorphism in C involves casting to void* and losing type information. If you do this improperly, you won't find out about it until runtime and the result could be a runtime error where you program terminates or it could be a logic error where the program simply outputs garbage. In general, languages you use today other than C/C++ are probably strongly typed. Any violation of the type system causes an error either at compile time or runtime.

The static vs. dynamic axis is much more interesting. Scripting languages and functional languages derived from LISP often sit completely on the dynamic type checking extreme. If your code contains an expression like sqrt("hi"), things will run just fine until that line is actually executed. This is very typical for languages that use traditional, dynamic duck-typing.

There are also some languages that fall at the other extreme of the spectrum where absolutely everything has passed through type checks at compile time and nothing remains to be done at runtime. Many of the languages in the ML family, including SML and Haskell fall here. Those languages typically have type inference where the types of values are inferred by their usage and those inferred types are checked for correctness. Older procedural languages, such as Pascal, also fell at this extreme. The lack of OO and inheritance hierarchies made the type systems of these languages very straightforward.

Most of the languages that people think of as being statically typed actually include some dynamic typing as well. This is because they include things that simply can't be checked at compile time. In OO languages, you run into this when you have to downcast to a more specific type. As a general rule, this is considered a bad thing to do. In C++ the dynamic_cast is there because they need it, but it is intended to look ugly. In Java, unfortunately, the early versions of the language, before generics were introduced in 1.5, required a lot of downcasting. The collections were collections of type Object and every time you took something out of a collection you needed to cast it to a more specific type. This is ugly and error prone. Hence, the addition of generics. My own favorite language, Scala, includes an asInstanceOf method in the Any type, but use of it is strongly discouraged. (Just search for it on StackOverflow and you'll see how much it is discouraged.) The approach that is recommended in Scala is to use a match expression. This is still a runtime type check, so even well written Scala includes features that involve dynamic type checking. This is a feature of OO languages because of their type hierarchies and the occasional need to downcast. The ML family languages avoid that by simply not having type hierarchies.

Summary

So that wraps up my discussion of different terms for type systems. Next time you want to call something "strongly typed", think a bit and consider if what you really mean isn't "statically typed". A lot of the time, that is what people really mean.