“If carpenters built houses the way programmers write code, the first woodpecker to come along would destroy civilization”. – Gerald Weinberg
We all have to deal with errors in our code, or at least, we’re supposed to. However, you’ve probably experienced a situation where a small problem caused another small problem, which caused a big problem, which ended up taking an entire system down. If you’ve experienced this, you are probably also painfully aware that these failure can be very expensive and hard to recover from. Second order effects are usually not that hard to predict, if you are thinking about them. You’ll also find that neither you nor anyone else does a particularly good job thinking about second, third, and fourth order effects of decisions.
Massive failures, in particular, are easily predicted. For instance, the Japanese attack on Pearl Harbor was easily predictable, based on rising tensions between the governments of Japan and the US, the layout of islands in the Pacific, and advancing aviation technology. In fact, the attacks were predicted in 1910 by Billy MItchell, who analyzed the situation. However, in spite of these predictions, the US fleet still was caught unprepared. Not only did this cause a catastrophic loss of lives and equipment, but it also led to a larger war, a war that might not have happened (or might not have occurred in the same way) had things been prepared differently. At the end of the day, people in positions of power didn’t believe that such a chain of events could happen, nor were they adequately able to shift resources to deal with a rising threat. Such a threat couldn’t be moderated by academic theorizing, but would require being in a position to get regular feedback and iterate quickly (this is a little harder to do with battleships, however).
Antifragile design is more complex that simply handling errors and keeping them from causing problems. Rather, it is an approach to embracing and using the natural chaos of the world to improve your response to small problems, so that a chain of them doesn’t result in a massive failure. As explained by Nassim Taleb, “antifragility is a property of systems in which they increase in capabillity to thrive as a result of stressors, shocks, volatilitty, noise, mistakes, faults, attacks, or failures”. It is fundamentally different from the concept of resilency (the ability to recover from failure). Such systems exist in nature, in biological systems as diverse as muscles and immune systems.
There are general principles of system (and life) design that need to be applied if you don’t want to constantly be dealing with problems. The principles of antifragility are especially useful to observe when you start noticing catastrophic failures that manifest as a chain of smaller failures that increase in intensity until something breaks in a major way. Following these principles, you should be able to figure out and mitigate many of them.
The continuum of fragility
Fragile – Breaks when pressure is applied.
Robust – Handles pressure, but does not become stronger under pressure.
Antifragile – Uses pressure to drive improvements that make the system stronger over time.
Why antifragile in systems?
People expect systems to stay up, but as systems become more complex, the number of things that can fall apart increases exponentially. As a system scales in complexity, the cost of downtime increases as well as the difficulty of bringing things back online, unless you specifically design the system to improve recovery options.
A partially functional system that still allows most operational goals to proceed has less “cost” in terms of time, money, and management/customer irritation than a system that completely falls over. A partial system outage doesn’t always have to be visible to everyone. If it doesn’t become their problem, you’ll find most people don’t care.
Small system failures are learning opportunities for avoiding big system failures, if you can survive them. Total avoidance of failures is impossible, so “shaping” your reaction to failures is an important goal. Catastrophic system failures are often caused by the buildup of smaller failures. The more small failures your system can successfully mitigate, the more likely it is that you will avoid at least some catastrophic failures.
Why antifragile in life?
We all have a normalcy bias, but that bias can really get us in trouble. The job, relationship, life situation, or health that you took for granted may not always be there. If you aren’t prepared, the damage can be significant. Your life has probably accumulated more complexity over time. This is true for most people. Along with the added complexity, you have probably accumulated a few additional “life failure” modes that you haven’t considered.
While some interruptions in your life are unavoidable, you will have a better quality of life if you can avoid catastrophic failures. Like computer system failures, massive collapses in life are often composed of a number of smaller failures chained together into a catastrophic collapse. For instance, your spouse leaves you, you lose your house, and then your car breaks down. You could survive any of those by themselves, but taken together they will make you homeless.
It’s important to note here that we are using the term “failure” in the sense of a systemic collapse (avoidable or not) and not in terms of a value judgement on someone experiencing these problems. Cascading failures in life are extremely destructive, usually expensive, and can be fatal if you are particularly unlucky. Worse still, they often have psychological and financial impacts for years afterwards, which can leave you vulnerable to even more catastrophic consequences.
Rules to be Anti-Fragile
Stick to simple rules
“How can you think yourself a great man, when the first accident that comes along can wipe you out completely.” — Euripides
Adding complexity for its own sake makes systems more fragile. While sometimes it’s true that systems really are complex, it’s also true that human beings are profoundly capable of making simple things more complicated than they are.
How to apply in systems: Don’t use complex architectures software until you need them and can prove that you need them based upon real world experience.
How to apply in life: Don’t add complexity to your finances and lifestyle until you can afford well in excess of them. Limit yourself to smaller, more achievable goals rather than grand visions.
If a piece of complexity isn’t serving a purpose, get rid of it. If you can’t get rid of it, at least practice being without it, so that failure modes don’t surprise you. This, by the way, is why stoic philosophers suggested “practicing poverty”. Actually put yourself in a position you fear for a while periodically to see how you handle it.
Systems that are controlled in one place are easy to disrupt in one place as well. Top down control topologies can only work well in very small systems. Note that centralization is how most systems (natural or otherwise) will trend over time, right up until centralization creates a vulnerability that often breaks the system entirely.
How to apply in systems: Have failovers for critical systems that have to be centralized (such as authentication). Try to have setups that are interruption resistent (queuing, etc.) for parts that don’t have to be centralized.
How to apply in life: Don’t overly rely on one person or situation for your quality of life, whether that’s a job, a relationship, etc. This doesn’t mean “having a backup spouse”, but it does mean that you need to have a support network and friends in addition to a spouse. Have ways of making money if your job disappears.
When adding to systems, prefer adding things that aren’t centralized or have redundancy built into them. This is one of the reasons many businesses are moving their infrastructure to the cloud – doing so makes it easier to recover from failure because redundancy is built in.
Develop layered systems
This is an extension of the previous rule and can help mitigate problems where logic is required to be centralized. For instance, if you have a multi-tenant piece of software, the list of tenants is crucial to the system staying online. However, that constitutes a central point of failure. If, however, the list of tenants was loaded into a distributed cache, that layer would effectively protect the rest of the application from a transient fault in this component.
How to apply in systems: Use layering to keep problems in one area of the system from leaking into other areas. This is the core reason that things like objects and three tier architectures are popular in development instead of global variables and business logic in the database.
How to apply in life: Protect yourself from a job loss by living below your means, having an emergency fund, etc. Create latency between the time when you lose your job and the point when you run out of money to live on. Have good boundaries so that other people’s drama doesn’t harm your quality of life.
Always add protective layers between systems that are criical to you and systems that are volatile or not under your control. Prioritize keeping problems from spreading.
Build in redundancy and overcompensation
If part of a system is potentially volatile (and they all are), have the ability to swap it out with something else. Systems that run close to resource limits have an increased chance of a catastrophic failure.
How to apply in systems: Don’t add system capacity that is only sufficient to handle your current maximum load, unless you can very quickly scale up. Don’t have only a single instance of something that provides a critical service. Make sure that you have failovers that are each capable of handling most of the load of a system so they don’t get overwhelmed.
How to apply in life: Don’t live hand to mouth. Have other potential options other than your current day job and be able to pivot quickly if things go badly.
Think carefully about how things scale and what failure modes look like at scale. If possible, simulate failures to make sure they can be withstood.
Do not suppress randomness
Life is random. While it is necessary for industrial production, the reduction of randomness and variance in complex systems leads you to assume that your system isn’t subject to randomness. You are often far better off by designing to automatically adjust for variance, rather than assuming that there won’t be any. If randomness and disorder are accounted for when things are not catastrophic, it reduces the odds of a cascading failure by limiting the range of influence of that variance.
How to apply to systems: Don’t assume that “this problem can’t happen at this point in the code”. Always enforce invariants, rather than simply enforcing them where you think they might happen.
How to apply in life: Don’t leave home for appointments at the last possible moment. Leave early with the assumption that traffic might back up, rather than having a “failure” when something happens. Or convince others to lower their expectations (if you are comcast).
Skin in the game
Don’t take advice from people who have nothing to lose by being wrong. Don’t allow people to make decisions if there is no cost to them for being wrong. Risk clarifies the mind, and essentially provides a “price discovery” mechanism for their advice.
How to apply in systems: Don’t rely on third parties unless the third party has an SLA that covers the costs a failure might impose. If you don’t, then their foulups are cheaper for them than they are for you.
How to apply in life: Don’t put someone in a position to mess things up for you if there are no consequences for them doing so. Try to extricate yourself from such situations as soon as you discover them.
A corrollary to this is that if you want success, then you need to have skin in the game as soon as possible. Otherwise you are shooting in the dark trying to figure out what works and what doesn’t. It will force you to be better.
Tricks of the Trade
Not everyone on your team will see everything the same way. For the BA they may need to think in terms of tasks but that doesn’t make sense when it comes to creating development user stories. How you approach the situation and your attitude can have a drastic affect on the outcome. Going in with the attitude of “This is the dumbest thing ever.” will lead to a fight, but going in with an attitude of “I think there’s a misunderstanding.” will lead to a conversation. Starting conversations when things don’t make sense will lead you to see the other person’s perspective and come up with a solution.