Building a Compiler
Part 3 – Introduction
This series of articles is intended to be a kind of documentation of my design and implementation of a new programming language. Hopefully, writing out my thoughts allows me to improve on them easier. Additionally, quite of few of the topics within programming language implementation are very hard to either learn about or to grasp. For example, the process of writing your own executable on Windows without using your own linker requires digging up articles from the 90's (Granted, this isn't something most people need to do, but it'd be nice to have this information readily available).
Programming Languages vs. Implementations
First off, a few crucial distinction: A programming language is fundamentally distinct from an implementation of a programming language. A programming language is nothing more than a specification of the behavior that a given text will have. Technically, the program doesn't have to be text. The program could be something like bits or a visual program. The programming language just has to describe what that representation means in terms of behavior. An implementation, on the other hand, is the specific program that actually executes said behavior. There may be multiple different implementations of a particular programming language, but (ideally) they will all have the same end result for the same program.
A few languages confuse this difference. Python, for example, has one major implementation in CPython, which defines Python better than its spec does. Other Python implementations, such as PyPy or IronPython, struggle because they cannot leverage the C Libraries like numpy which have been coded to rely on CPython implementation details.
On the other hand, C does very well in separating its specification from its implementations. A new C specification comes out every once in a while (ANSI C, C99, C11, etc.), and every C implementation does its best to meet a particular version of C's specification. GCC and Clang typically aim to implement everything in the latest version, while MSVC was stuck on ANSI C for a very long time.
The Compiler
The options for implementing a programming language boil down to either making a custom chip which implements a programming language, or translating code from that language down to machine code so that it can run on a general chip. Given that the first option is very difficult and inefficient, virtually everyone chooses to go for the second option, either directly or indirectly. Most C implementations take code written in C and translate that to your processor's machine code. Meanwhile, CPython translates Python to C, and then translates C to machine code. The actual order this happens in is the other way around. We'll get to interpreters shortly.
That translation step is what defines a compiler. A compiler, in its most general sense, is a program which translates code from one programming language to another, and is therefore the most common way of implementing programming languages. Anyone who took the second path I mentioned above wrote a compiler. However, even though this definition of a compiler is nice and short, it isn't particularly useful. A compiler which translates C to Machine Code is going to look very different from a compiler that translates Python to C. And even within translating Python to C, there are two very distinct ways to do so in interpreters and transpilers. Therefore, compilers are divided into 4 distinct types.
Types of Compilers
A traditional compiler specifically refers to a program that takes a program in a particular programming language and outputs a program in machine code. Thus, programs such as gcc, clang, or rustc are what are called "compilers". Compilers have a well-deserved reputation for producing performant results, since it isn't possible to be faster than machine code, and since all the work of translation is done before the resulting program is actually run.
A transpiler is a more general version of the aforementioned type of compiler. Instead of outputting machine code, it instead outputs code in another (relatively) high-level language. There aren't very many examples of transpilers, since the input and output languages have to be very close in semantics in order to get any benefit in writing one. Naturally, any Turing-complete language can be implemented in any other Turing-complete language. However, if the semantics of the two languages are too different, then we can't translate equivalent concepts in one language to another. For example, in a C to Python transpiler, how do you implement pointer manipulation? Your only option would be to simulate an entire computer from the bit-level up in order to properly capture C's semantics, which is obviously not useful. Early C++ implementations were transpilers which outputted C code (the addition of exceptions prompted C++ implementations to become traditional compilers). More recently, Typescript is implemented as a transpiled language to JavaScript. Both early C++ and TypeScript are very close in semantics to their host languages of C and JavaScript, respectively, so transpilers could be written.
Interpreters take code in one language, and execute it on the fly within another language. In a sense, they are similar to transpilers in that they use an intermediate language between the source language and the machine code the computer actually executes. The primary (and very important) difference lies in the fact that the translating between the source and the intermediate language is done at run-time instead of beforehand like a transpiler does. This has many benefits, the foremost being that more information is available at run-time than compile-time, so normally difficult translations become trivial. For example, in CPython, a Python interpreter which uses C as the intermediate, the duck typing that Python uses would be normally impossible to translate beforehand to C. However, by doing the translation at run-time, it becomes very easy through use of associative arrays (also known as hashmaps or dictionaries).
The final type of compiler, a Just in Time compiler (JIT) is an intermediate between traditional compilers and interpreters. The bulk of translation happens at run-time like in an interpreter, but the translation is to machine code like in a compiler. A well written JIT can therefore leverage the speed of a traditional compiler, while also having the flexibility of semantics that an interpreted language enjoys. A poorly written JIT, however, will be very slow due to the long translation times from the source language to machine code.
Why I'll use a Traditional Compilers
The different types of compilers fit neatly on a scale of the difficulty to code. Interpreters are fairly easy, and it is therefore easy to find tutorials on how to make one. Transpilers and traditional compilers are quite a bit more difficult than interpreters to create, and are therefore slightly harder to find information about. JIT's are very difficult to get working properly, and so are even harder to learn about. I'll be making a traditional compiler, since they offer what I believe is the ideal trade-off between implementation complexity and language usability.
This series aims to be a thorough exploration of how to code a traditional compiler (referred to hereafter just as a compiler). Additionally, I'll use no tools beyond the most common functions of the standard library. This means that no parsing libraries will be used, nor any assembly-generation tools like LLVM. I'll be writing the compiler in C, and developing a marginally better version of C. Finally, the name of the language I'll be implementing is 'Swerve'.