Sample Data Generation

When you start a new software project, you probably don’t spend a lot of time thinking about how to setup useful and realistic test data. In general, by the time you start thinking about it, your application has reached a level of complexity that doesn’t make it easy. Additionally, you will be tempted to try and create your sample data set in a hacky, short-term manner. While such solutions can work for a little while, their shortcomings eventually become obvious. In particular, you’ll start to notice that bugs are occurring in production that should have been caught in development, if they’d only had good data.

As your application ages, there are numerous reasons to start paying better attention to the way that you manage sample data for use by developers. Not only does this enable more agility within your development workflow, but it can make troubleshooting easier. Further, having a clean way to generate sample data on developer machines can make it easier to onboard new team members. If you make the generation of sample data into a repeatable process, it also makes it far easier to do repeatable system testing from a known good state.

Done properly, the generation of sample data becomes a first-class portion of your application development workflow. As code evolves, your sample data should also evolve in a way that makes sure that it remains useful for testing how the system will behave in production. While this seems like more of a QA role, making this a development responsibility confers several distinct advantages. In particular, it tightens the feedback loop between development and QA. It also makes it quicker to set up a new environment, whether that is a new environment for a new developer or an environment being used in other parts of the development process, possibly including quality assurance.

Having good sample test data in your local development environment is critical for being able to effectively write code in your local environment. Ideally, this test code will reflect a lot of the sort of scenarios that you are likely to encounter in production. Not only do it make troubleshooting easier, but it also increases the likelihood that you will spot potential problems earlier in development. Finally, it makes it much easier to onboard new developers and reduces the amount of friction developers experience when beginning to work on a different part of the system.

Episode Breakdown

Why you need to have good sample data in your local working environment

Sparsely populated local development databases are fast and this speed can mask performance and data integrity issues. If something can happen in production and cannot happen in your local (or worse, QA) environment, then it will consistently surprise you in production. Having appropriate test data in your local environment ensures that you can do reasonable sanity checks without getting QA involved. This tightens your feedback loop and makes you more productive.

This test data also makes it easier for you to quickly troubleshoot new scenarios that come up. Let’s say that you find there is an issue if a certain column value is null in a specific use case – it’s much less effort to null out that column as needed, instead of having to fully (and manually) create all the required test data from scratch. Additionally, as you make schema and system changes, having appropriate sample data will force you to write appropriate data migration code, rather than putting it off until it breaks in the QA or production environment.

Why shared dev databases are evil.

It can be tempting to use a single database for all developers to make it easier to have appropriate test data. However, there are far more downsides than upsides with this approach.

First, this approach means that the actions of a single developer can break things for the entire team. Because such databases are everyone’s responsibility, the data in them ends up being no one’s responsibility. This tends to result in very poor quality data that causes problems in the app, problems that might not be possible in production.

Shared databases are also problematic when you are attempting to load test, stress test, or do large changes to the system because they require coordination across the entire team. At best, this will slow you down.

Why can’t I just copy production?

Back in the day, it wasn’t uncommon for developers to simply work from a copy of the production database. However, considering the modern regulatory environment, this is probably a really bad idea. While your production database is certainly a realistic depiction of what real world data looks like currently, it adds complications when you are trying to make large changes and need to refresh your sample data.

You could potentially use a copy of production and sanitize the data to remove sensitive information, but there is a risk that you’ll forget to sanitize it, or that you’ll miss something. The other problem here is that your production database may be too large to work with effectively, both in terms of disk space and in terms of the time required to rebuild your sample data.

Why developers need to be in charge of local sample data (instead of QA)

Since QA has to generate sample data for testing, it might be tempting to try to reuse it. And you might get away with it for a while, but eventually it is going to become a problem. However, the data set as understood by QA is different than what you need as a developer. For starters, if you are building anything new, you are going to need realistic sample data for it, before QA touches anything.

Additionally, many QA departments largely deal with integration and load testing. That is, they envision your data as it exists at the EDGE of your system, not in the middle of your system. This can mean that they don’t have test data for everything that you need. Finally, leaving QA in charge of sample data creation for developers can add an additional burden to your QA team, which may well already be overloaded. Good QAs are less common than good developers in many locations, this is probably not a desirable approach.

What is meant by “realistic” test data.

Frequently, developers will generate test data in a loop. While this approach is quick and dirty, the data tends to be fairly uniform. This uniformity means that the sample data may not accurately reflect what you’ll encounter in the real world. When generating sample data, you are going to want a broad range of values (for numeric values), a broad range of lengths and content (for strings), as well as realistic numbers of items in lists (for instance, you want test orders to have a range of items in the order.).

You also want to be careful about dates in your test data. It doesn’t take long for realistic test data to become unrealistic if dates are in the mix. This means that you are going to want dates to be set relative to the current date, instead of being static. Additionally, you will need to make sure that you have a realistic distribution of data, especially on a per-user basis. In many developer data sets, there will be one or two users with thousands of transactions, while others have none. This makes reports look strange and unrealistic.

Why you need to be able to quickly nuke and pave your data.

You are going to want developers to be able to quickly remove and rebuild sample data. On large teams, this means that developers will quickly have sample data for parts of the system that they didn’t build. Being able to delete data without consequences will also make things like load testing easier, as you don’t have to keep massive data sets around forever.

This ability also makes developer onboarding easier, because the setup of sample data is streamlined in this approach, meaning that setting up a new developer machine (or moving a developer to a newer machine) is faster. This could even make it possible to wipe and restore data at the beginning of every new story, so that existing bad data won’t be a problem.

Why sample data shouldn’t be created at the end of database migrations.

It’s common to only create sample data in a new system after all database migrations have been applied. While it does mean that all the sample data is in the same place, it means that any artifacts of migration are no longer represented. Since your production data was not all entered after migration, this means that your sample data no longer reflects what could happen in production.

Additionally, if you can generate your sample data (in dev only) as part of your migration process, then you can also quickly roll back a migration when you are testing. This simplifies testing migration code (as long as the Down() part of your migration scripts is sound). If your sample data is all applied at the end and is extensive, this means that small changes in your system will require much larger changes in your sample data generation code. This can be annoying when consolidating database migrations. You’ll need to code around this.

Branching, merging, and source control considerations.

Your code for generating sample data should live alongside the main application code in source control. It does not necessarily have to be deployed with the rest of the application (and often shouldn’t be), but it gets nasty if this code is out of sync with everything else. Your sample data generation code should be broken up instead of being in a single massive file. Otherwise, merge conflicts will be extremely annoying as your application gets older and larger.

When you are working on a branch and need to change your sample in some way (either by adding more, altering data with a migration, etc.), then the changes to the sample data should be done on the same branch. This way, it all stays together through the rest of the software lifecycle. You will probably want to generate your sample data using the same language that you use for your business objects, but you should not use your business objects for this, since the business objects change based on the version of the application, rather than the database migration level.

What about the unhappy path?

You should also have data in the system that doesn’t reflect everything working perfectly all the time. This covers everything from failed transactions, to users whose accounts are not completely set up, to archived and deleted records. This is something that is easy to miss when generating sample data, but it’s vitally important that you include such sample data in your system, especially if you are going to be nuking and paving data regularly.

You want to make sure that such data is present so that you can see how it looks in the UI and in reports. If you don’t have this data available, you’ll often miss small things that will be found in QA, or worse, in production. Another situation where the unhappy path matters is the case of data migrations. Unhappy paths often have data that is different than happy paths. This can cause migration problems and subtle bugs, so you are better off covering this situation locally as well.

Boundary conditions, and “types” of data your application deals with.

In addition to testing the unhappy path, you should also consider things that function as boundary conditions in your code. For instance, if your application handles monthly payments, you should cover payments on the first, the 28th, days after the 28th, February 29th, and year end. A lot of this stuff will be application-specific. For a given type in the system, you’ll want to make sure you have a good mix of possible states, especially if those states are used in conditional statements within your code.

If a property is used in arithmetic calculations, make sure that you have some records where the value is set to zero. Similarly, if you don’t have data constraints keeping a value from being null, then make sure nulls are represented in your testing data set. Also make sure to consider zero states. This would include things like customers with no orders and the like. This will help you make sure that UIs and reports look right when some data is missing.

Data construction for easy troubleshooting

When you are creating your sample data, do so with an eye towards making troubleshooting as easy as possible. While realistic-looking names sound nice, they aren’t particularly helpful when you are trying to actually use the sample data. Many database entities have name or title attributes. Use these to make it clear what these test entities are. For instance, you might create a customer named “NoOrders Jones” instead of “Bob Williams” to make it obvious that it used for testing the zero state of orders on a customer.

This becomes especially important with more complex testing scenarios, especially if those scenarios deal with configuration that isn’t immediately obvious. This is also one more reason to frequently nuke and pave your sample data. Otherwise over time, small changes will eventually mean that they don’t match their label.

Tricks of the Trade

A lot of times in life we have to make decisions based on limited data or on a sampling of the available information. For example, what car to buy? You aren’t going to go out and test drive every single car or even every type of car so you make decisions based on groupings and samples of data. This helps you narrow down you choices. Obviously the larger your sample size the better able you will be to make decisions. Unfortunately we aren’t always able to have larger sample sizes but still have to make choices based on what we do have. When you are in that situation an understanding that you may not have all the information or that a small sample is not always representative of the larger group will help you.

Tagged with: , ,