Compilers are required tools for software development. Essentially, both tools translate your written code into something that a machine can use.
Most of the details of this episode come from the book “Compilers: Principles, Techniques, & Tools” by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. It’s an excellent book for deeply understanding how compilers actually work under the hood. You may have heard this book refered to as the “Dragon” book because of it’s over-the-top awesome cover.
There is a lot going on in the process of converting your high-level code into results that come out of your machine. While you may not have to worry about this stuff on a day to day basis, it’s important to understand what is actually happening to convert your code into something the machine can use. In many cases, understanding what is going on under the hood can help you troubleshoot certain kinds of performance problems and other weird issues in your code. Additionally, the understanding of how compilers work can also be informative when dealing with certain kinds of problems in higher level code as well.
A program that can read a program in one language (the source language) and translate it into an equivalent program in another language (the target language).
A language with a strong degree of abstraction from the details of how the computer operates. An example of this might be C++, C#, etc.
A language that provides little or no abstraction from the computer’s instruction set architecture. An example of this might be assembler.
Instruction set architecture
An abstract model of a computer. This defines everything that a machine language programmer needs to know to program the computer. These define things like CPU registers, main memory, how memory is addressed, and the like. In some future episode of the podcast, we’ll talk more about how.
Machine language is built up from discrete statements or instructions. An instruction may specify the CPU registers in use, an operation occuring on the contents of the register, and a memory location. These may include logical operations, arithmetic operations, control flow operations, and the like.
A register is a quickly accessible location in the CPU of a computer. They are usually a very small amount of extremely fast storage. Most systems load data from memory locations into registers, do operations on the contents of the register, and then move the data back into memory. This is why it’s faster to XOR a register’s value with itself than it is to move a zero into it in many older processors. Registers are usually referred to by their size, which is expressed in bits. Registers are usually named. Modern x86 architecture will have several duplicates of a lot of their registers for the purposes of performance. There are few registers and they are small.
Memory addresses are references to a specific location in the computer’s memory. They are usually unsigned integers of some size or other.
The stack is a data structure that stores information about the active subroutines of a computer program. Essentially, when you call another function, relevant variables from your current function are pushed onto the stack. This includes a pointer to the location in memory of the currently executing instruction. Next, execution proceeds to the memory location when the function you called begins. Once it is done, the previous address is popped off the stack, execution goes back there and the relevant variables from the previous call are popped off the stack. This is substantially more complicated than it sounds.
Compilers can either be single pass or multipass. Single pass compilers are usually faster. Many early ones are single pass. Multipass compilers go through the process several times and are advantageous for more advanced optimizations of the compiled code.
19:30 Why a compiler?
The explanations above are hugely simplified from what really goes on. Programming in assembler or other truly low level languages is much harder to reason about for most people. The best way to describe it is that you tell the computer what to do on the hardware and the “side effects” of that produce the result for you.
The hardware that your code runs on will vary a lot, especially at the lower levels. You really want a layer of abstraction around it. While you can more heavily optimize lower level code, it can take a ton of time. Many optimizations are more easily done by a tool, instead of a person. Processors can do multiple things at the same time. This is hard for humans to reason about effectively at a low level.
You probably also want to write code that more accurately represents your mental model of a problem, rather than the way a computer would solve it.
23:45 The Compiler Front End
The front end of the compiler is responsible for analyzing the source code to build an internal representation of the program. This includes building a symbol table which maps parts of the source code to information associated with them.
This is usually broken into three phases. Lexing, which breaks a sequence of characters into a sequence of tokens, which are strings with an assigned and identified meaning. Syntax analysis, which converts the sequence of symbols/tokens and turns it into a representation of the code as an expression (usually in a tree form, called a parse tree). Semantic analysis, which adds semantic information to the parse tree and builds a symbol table. This also does checks such as type checks, object binding, and lots of other things, depending on the language in use. This is usually when incorrect programs are detected. The Front End may be preceeded with a preprocessing phase that handles things like macro substituition, conditional compilation constants, etc.
27:05 The Middle End
The middle end is responsible for optimizing the intermediate representation produced at the front end. This usually has a couple of steps. analysis. A number of different approaches may be applied here, but the goal is to find operations whose complexity can be reduced. Optimization. This is the phase when the optimizations are applied to the parse tree.
Optimizations may include a number of things, such as the following: Inline expansion, which replaces a function call with the contents of the function. This saves a jump,stack allocation, etc. Dead code removal – does just what it says by getting rid of hueristically unreachable code. Constant propagation – If an expression is composed entirely of constants, then replace it with the result. Loop Transformation – There’s a lot of stuff that can happen here, which includes optimizations to break a loop into multiple loops, swapping inner and outer loops, etc. These are done for performance. Most program code happens here. Automatic Parallelization – converts sequential code into parallel code so that multiple parts can execute at the same time (where possible). This is tricky if there is any shared state.
35:15 The Back End
The back end is responsible for CPU architecture-specific optimization and for actual code generation. There are two main phases here. Machine dependent optimizations, such as peephole optimizations, which rewrite short sequences of instructions into more efficient instructions. Code generation, which is the point where the intermediate version produced by the preceeding steps is converted into the output language.
Code Generation is a little more complicated as well. A lot of decisions about storage and scheduling are handled during this phase. This may involve rearranging the order of instructions in a program to something that is computationally equivalent but that has less risk of pipeline stalls. A pipeline stall occurs when the execution of an instruction is delayed to avoid a hazard. Hazards occur when the next instruction can’t execute in the following clock cycle or will execute with a bad result. Essentially, these hazards can result either when timing issues with the CPU aren’t considered, or when timing issues around when data should be read or written are not considered. Think about deadlocks and race conditions in higher-level code and it’s pretty similar. This is also when memory addressing modes are considered.
Code Generation also generates other useful artifacts. The information that your debugger uses may also be generated during this part of the process.
eSight Eye Wear
One step closer to Geordi’s visor in Star Trek, eSight is a way to help the visually impaired see. It works by capturing high-quality video and displaying it on two high-resolution screens. One in front of each eye. This footage is also enhanced by custom optics and special algorithms designed specifically for the wearer. It allows people to live normal lives, enjoy hobbies, and see their loved ones again. While rather expensive, they do have a payment plan.
Tricks of the Trade
Learn the low level stuff. If you understand how the guardrails that protect you are built, you have a better chance of not having a terrible accident by assuming they do something for you that they don’t. Additionally, you can learn when to get around them. But above all, learn how the machine really works – it won’t help you often, but when it does, it helps you a lot.