Code Generation Antipatterns
Podcast: Play in new window | Download (47.6MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
Code Generation is an important part of most software projects today, whether the people involved know it or not. Lots of code is built on the fly for you with most tools, and for good reason. It’s often easier to handle a lot of roughly similar code by writing code that writes the code. Tool vendors do this all the time with a lot of success.
However, we’ve all had the experience of working on a project where Some Dude came in and slung together a code generation framework in an afternoon. Far from creating a cleaner codebase that’s easier to work with and better for the team, most attempts at code generation in the wild usually end up wasting money and time. Not only is the code often hard to extend, but it is often accompanied by a raft of sloppy hacks as other developers tried to work around its shortcomings. It’s not because of code generation – it’s because the people writing the code generators don’t have a firm grip on how to avoid major problems and because they fail to account for occasions where other people might need different behavior from the framework.
Most code generation in the wild is done in a sloppy manner that causes problems later. However, if you use reasonable patterns and design your code generator well, you can avoid a lot of these problems. Above all, you should treat generated code as a first class citizen, with all that implies. This code may be building a substantial and critical part of your system, so it’s important that it be treated with at least as much care as a similar code that isn’t generated. Code generation by itself doesn’t produce better code – it simply makes more of it.
Episode Breakdown
Generating all system layers in one step
Only use a single step of code generation to generate one layer. This keeps the scope of changes limited and keeps concerns from being tangled. The single responsibility principle still applies to code generators as well.
Layers utilizing the current layer should be generated after the current layer from metadata. This helps force developers to keep the generated code from being entangled. It can also make it easier if multiple variants of the caller are needed.
Blackboxing
You need to be able to see what is going on in the generated code. This includes the ability to step through with a debugger. This means that generated code assets should be raw source, not binary. This also means that obfuscation, and minification should be done later.
You need to make sure output, especially errors, are straightforward. This means that your framework should never emit a NullReferenceException – you should generate something meaningful to tell the consumer what is null. You should also consider having retry logic in the code where appropriate. You might also wish to emit events when certain conditions occur. Then consumers can listen for those events and take further action. Above all, make sure that someone without documentation can quickly reverse-engineer your generated code.
Lack of logging or poor logging practices
Consumers of your generated code also need to be able to see what happened in a runtime environment. This means robust logging needs to be included, along with the ability to redirect logs to different formats. You shouldn’t try the “Logger-As-Glorified-TextStream” antipattern here. Your code should not have an opinion about what is important, only an opinion about the logging level. You may also want to number your errors, so that users can quickly find problems in the generator by error number.
Logging should include enough information to assist in debugging and modifying your code generator as well. Remember that the generator is also code and subject to the same kinds of problems as regular code. You need meaningful error output when a problem occurs, even if it is doing nothing more than writing out to a console somewhere. Consider directly dumping errors as raw lines to your output files as well. If you are using a compiler, this will make the problems easy to find.
No ability to override behavior
Code generation is a good way to handle roughly 80-90 percent of roughly similar code. It goes sideways if you try to make it handle 100% of cases. Let the consumer implement the weird edge cases. Consider making an “ignore list” of components that you won’t build, so that users can implement their own safely without the framework overwriting their changes. You want your code generator to act more like a library than a framework.
This is implemented differently in different languages. In many languages, you need only expose events that are consumed to modify the way the application works. In other languages, you might use class inheritance, or partial classes to handle this. You might also extract certain functionality out into its own class and use dependency injection to put it in without having to touch framework code.
Not outputting metadata
Useful metadata about the generated code, its structure, and other constraints should be output in a form that can be used by other code generators further up the chain. While your language of choice may support a degree of meta-programming, other languages may not support it as well. This metadata can also be used to generate unit tests to make sure that the code aligns with the output metadata.
This metadata should probably be in a portable format of some sort. XML or (more popularly) JSON are a good choice for this. You want the format to be human readable so that anyone generating code from the metadata can do so easily.
Directly hitting assets such as databases
During code generation, it can be tempting to do things like directly query databases for data to use in code generation. While this is simple, it also makes it a nightmare to write tests for your code generator. In addition, this approach tends to slow down the generation process considerably. It also slows down development of your code generation templates.
Directly hitting a database instead of a file full of database metadata also causes a few other issues. If all your developers share a database, it can mean that one team member editing the database can break code for everyone else. It means that development stops when the database server is down, rather than being able to continue along. If you are source controlling a file full of database metadata, it may be useful in reconstructing the database at a later date as it was at the time of a checkin.
Lack of ability to unit test
Your generated code (and your code generator itself) need to be built in a way that allows you to unit test them. Instead of generating a single item by hand and then testing it, a code generator may generate dozens or even thousands of similar items. You’ll want to run tests across them all to make sure that you catch edge cases early. Your code generator should also be built in a manner that allows testing, as much as possible. When you have a large code generation process, the ability to quickly unit test the thing is very helpful for sanity checking.
Your generated code should also follow reasonable patterns of software architecture. Dependency injection is extremely important in generated code, as it is one of the best way to introduce isolation testing across a huge number of generated components. If you are generating code for systems that are object-oriented, you should generally follow other best practices like the single responsibility principle, open/closed principle, and proper encapsulation. You should probably also generate at least some of your unit tests and carefully watch to make sure you don’t develop huge gaps in test coverage.
Hand-modifying generated code after generation
If you generate code, you should (uhh…generally) not modify it by hand after the fact. This introduces several problems. The first is that it is really hard to fix all the places that you need to fix by hand, without missing any. This code will also be overwritten when (not if) someone runs the generator again.
Needing to touch the final output also points towards deficiencies in your design that should be addressed. If you need to modify the code, that means you probably really need to modify the template. This might also indicate that you need to have hooks and other ways to override the code. It can also indicate that you don’t have sufficient metadata when generating your code.
Lack of performance counters
You should be able to tell how often a piece of generated code is hit, how long it takes on average to run, and the like. This helps you watch for major performance issues. A common thing that happens with generated code (or any abstraction, really) is that less skilled developers will spend five seconds coming up with a mental model for it and then use that forever, even if it was wrong. Basically, if your code works, people will use it without understanding it. And if you generated a bunch of code, they’ll use it in a bunch of places. You want to track resource allocations (database connections, memory allocation, handles, threads, etc.), throughput, and time required for operations.
This doesn’t necessarily mean directly putting timing and other instrumentation into your code. Rather, you need to have a strategy for troubleshooting performance. This could be as simple as conditional compilation statements and as complex as adding in a library for instrumentation. It could also mean having a third party tool on hand that can do this work for you.
String concatenation instead of templates
String concatenation is an easy way to dynamically write code. It also quickly leads to a mess. If you use string concatenation, you quickly find yourself dealing with nasty problems, like escape sequences in huge blocks of code, attempting to do multiple levels of string replacement, etc. It’s also really hard to read and modify code that writes other code in this manner. You’ll also find that you spend an inordinate amount of time trying to format this code as well, both for use in the generator and for making it understandable once it is generated.
If you use a template instead, it makes it far easier to reason about the structure of the generated code. Remember that you are experienced in reading code in your language. You might as well keep the code generator to a similar structure. Using templates can also help you convert a working code sample into a template more quickly. Rather than trying to rework the thing to use string concatenation, you can just copy it into a template and modify the things that actually change.
Single file output
A lot of code generators (for instance, Visual Studio’s T4 engine) will dump out a single file. This makes troublshooting really nasty, as the files can get huge. This is also a problem when you or a coworker regenerates code, if the generated code ends up in source control, as merge conflicts on 70,000 line files are a truly awful experience.
Instead, your output should be broken up in the same way as your non-generated code. This can be tricky if your code generator has a bias towards single files. You also need to make sure that your files are named in a way where there isn’t a naming collision.
Book Club
Algorithms to Live By: The Computer Science of Human Decisions
Brian Christian and Tom Griffiths
The book closes out with a couple of chapters about interaction with others. Chapter 10 talks about networking and how we connect. It starts by explaining the history of packet switching on phone lines and gets into topics like message acknowledgement and into flow control in linguistics. Chapter 11 looks at game theory. In this chapter the authors look at man vs man and man vs society. They delve into cooperation and competition. In the conclusion they leave the reader with three pieces of wisdom. First that some algorithmic approaches can be directly translated into our lives. Second that using an optimal algorithm should come as a relief even if you don’t like the results. Third, you can draw a line between problems with a straight forward solution and those without one.
Tricks of the Trade
Something going wrong once or twice doesn’t mean an approach is bad. Lots of devs hate code generation because they’ve been exposed to situations where it was bad. That attitude is the same attitude that makes really crusty old coders think that modern practices are bad. Don’t fall into that trap.