Distributed Computing Fallacies
Your employer may think that you can easily switch from building a regular old application to building a distributed system. You might think so as well, but there are a lot of nasty surprises that can sneak up on you if you are not prepared. Applications are getting more complex, and many developers try to deal with this fact by breaking the applications into smaller pieces that can be managed more easily. That works well, but there are a few things you need to keep in mind when building distributed apps. While the bad assumptions that we make when building distributed applications will not cause problems early on, they can definitely create some unpleasant surprises as your code ages. It’s better to avoid them from the start.
The most important problem is the CAP theorem, which suggests that there are three constraints on distributed systems, and that you can only (at most) meet two of those constraints. Those constraints are consistency, network partitioning (where a part of the network fails), and availability. You can have up to two out of the three. Because you can only have two of the three, the other one will be the one that causes you pain. Because networks are unreliable at best, that means that you will be choosing between consistency and availability when there is a network partition (you’re good the rest of the time).
Another item to consider is database design. ACID vs. BASE. ACID stands for Atomicity, Consistency, Isolation, Durability. BASE stands for Basically Available, Soft State, Eventual Consistency. Essentially, the databases themselves are designed around the same sort of constraints that plague distributed applications. You can think of ACID as being extremely partition intolerant, with the goal of keeping consistent and available. You can think of BASE as being partition tolerant, at the expense of perfect consistency.
The assumptions we can safely make when building a distributed application are few and far between. Unlike a lot of smaller applications, distributed applications tend not to have system boundaries in the way that many other applications have. Consequently, there are a lot of things that can cause your application to fail in a distributed system. However, if you avoid the major fallacies, it’s a lot easier to get it right.
09:55 The network is reliable.
We really know better than to believe this, but we frequently code as if the network were reliable. Network calls are often coded as blocking I/O, especially when they are a few levels down in abstraction. This means that the calling application will often wait, sometimes indefinitely for the result of a network call. If this happens on the UI thread, the app stops responding.
“You’ll make the mistake of making a network call and treating it like it’s on the same machine.”
Sometimes the problem is more complex than just “the network”. Things like DNS servers can cause problems if they start malfunctioning, resulting in sporadic failures. Sometimes the network isn’t “down”, per se, but there are structural problems causing periodic packet loss that slows transmission. If your code has timing issues, this is when you’ll find them.
14:45 There is no network latency.
Even the fastest network call is far slower than a call on the machine in most cases. This tends to be a painful lesson when you develop and test on a single machine, and then deploy to a set of them.
Even in normal operation, network calls are orders of magnitude slower. This can make things like distributed database transactions very unwieldy even in the best circumstances. You may need to restructure the way you are making calls to reduce network overhead. You may also need to employ caching and things like content delivery networks (CDNs) to make this less of a problem.
21:05 Bandwidth is infinite.
We can deal with huge amounts of data within the same machine compared to what can be dealt with over the network. For instance, the SATA 3 interface inside your computer can transmit at up to 600MB/s of actual throughput. The average speed of a residential internet connection in the US is 18.7 Mbps. 4G internet is able to handle 5-12 Mbps in general, but may have issues with connectivity. These are US numbers. Other countries (and some parts of the US) may be far worse. This means that if you develop on a single machine, lots of your assumptions will be wrong when you deal with the open internet and even more will be wrong when you deal with mobile internet, especially internationally.
“The text message is the printer ink equivalent as far as price point for moving a byte of data.”
Bandwidth also costs money. Sometimes the cost is extremely high for things like satellite internet. It’s also possible to run into situations where your bandwidth gets restricted if you use too much. Windows updates have had significant problems in both of these areas.
27:00 The network is secure.
Back in the day, people tended to make the assumption that anything inside the firewall was secure and anything outside was not. This started changing as people brought their own devices into the mix, and as laptops became more common. Essentially, perimeter security on a network is not enough any more. You have to assume that hostile parties are inside the network.
“All it means is you’re not getting someone NOT who you are talking to in there.”
This means a lot of things have to change. You can’t assume that someone calling your application is friendly and you have to consider insider attacks. You can’t assume the transport is secure either, unless you are encrypting it. In many cases, you can’t assume that files or data coming from elsewhere in the network aren’t malicious. This includes data that other applications put in a database that you read and execute on.
32:35 The network’s topology doesn’t change.
A long time ago, it was reasonable to assume that the structure of the network was a constant. It used to take a lot of work to reconfigure things. That’s not true now, and especially with data center networking, it’s fairly common to quickly change the network structure to respond to changes. Machines are cattle now, not pets. This means that it is really unwise to hard-code things like IP addresses or even machine names.
Network topology changes can have an impact on both bandwidth and latency, so your apps will need to take this into account. Latency could change if (for instance) the main connection fails and a secondary connection is brought online to keep the system alive. Bandwidth could change if the admins bring in new hardware, or change configuration. Your application needs to gracefully handle topological changes so that it doesn’t start causing problems. This gets especially fun when something downstream of you can only handle so much traffic, but your app is only constrained by bandwidth. When the bandwidth restriction goes away, it’s entirely possible for you to nuke the other service.
37:50 There is a single administrator.
In smaller environments, there was often only a single administrator, or maybe two (junior and senior) who controlled the entire system. This has gone away as systems have gotten larger, more critical, and as the workforce has become more mobile and distributed. Your typical modern app may have several administrators, controlling different parts of the system. Like any complex set of people, this may mean that they don’t like each other, don’t understand each other, or may not even speak the same language.
“Any other developer can step in there because … they’re still using the same standards.”
This causes you to have to think about several things. First, you can’t fail processing due to security errors. You have to gracefully handle those, because security configuration is likely to be changed by people who don’t understand your system. This also means that you have to document things more thoroughly, as you can’t rely on word of mouth. This typically also means that you and the team(s) in your organization are going to have to standardize things so that administrators can manage systems they don’t know well. You also are going to need more thorough diagnostic capabilities, because people will be troubleshooting your stuff while you are asleep, and might be doing it a hemisphere away from you.
41:45 Transport cost is zero or negligible.
While we covered this a little in terms of how network bandwidth is not infinite, it’s also not free. Even if the network is entirely owned by you, every packet transmitted has a cost, in terms of time and maintenance to transmit it. If you are regularly and unnecessarily saturating the network, the administrators will often want to get better, more expensive hardware to deal with the load. If you are on a cloud provider, you are often charged directly for network use.
This means that you can’t ignore bandwidth costs in the design of your application. It’s entirely possible for an application to work perfectly and still generate bandwidth costs that make it too expensive to be worth using. This isn’t just restricted to your internal bandwidth – you need to be aware of what you are doing to client bandwidth.
45:35 The network is homogeneous.
You also can’t safely assume that cost, latency, bandwidth restrictions, and likelihood of network failure are the same throughout the network. Your code may well be running in different data centers. The speed between nodes in the data center may be fast, but the speed between may not. You may have clients that are occasionally or even regularly disconnected, like mobile users. There may be some slow links in the architecture for various reasons.
“Sometimes data centers have problems with their connections, all it takes is a back hoe”
There are a few things this implies. You have to code defensively around network errors, including complete failures and security issues. You cannot assume that any two network calls sent in order will complete in order. You have to consider carefully how your application’s network use may impact your clients. You don’t want to saturate a critical network with requests for something that isn’t critical.
How IoT is Making Distributed Computing Cool Again
Distributed Computing has historically been relegated to academia, military and large tech enterprises. Those distributed computing concepts will be critical to the success of internet of things initiatives. For example, Ford Motor Company’s invested $182.2 million into Pivotal, a cloud-based software and services company. This article talks about how the internet of things will bring distributed computing and how it handles data to the forefront of development.
Tricks of the Trade
Be careful of the implicit assumptions you make. This can be anything from assuming that a member of the other political party is necessarily hostile, to normalcy bias, wherein you think things will remain stable because they have been while you are looking. Pattern matching is generally helpful, except when a pattern is disrupted and you aren’t prepared for it.