A friend referred me to this excellent article about preparing for system disaster: Resilience Engineering: Learning to Embrace Failure. This is related to a theme I have been calling “robustness”.
This article proposes that to assure that your systems work well, you need to have a “game day” to actually run a test scenario where a major part of the system suddenly goes off line. It is fine to plan around certain thing becoming unavailable, but large, online systems are very complex, and the interconnections not always that obvious. Regardless of the amount of planning and design, there are going to be dependencies that you were not aware of. So test them by setting up a scenario, and seeing if the system survives. What you learn will make the system stronger.
This is really rather obvious; it is just what we call destructive testing and not really new at all.
However, it flies in the face of engineering discipline which is oriented around “prevention of problems”. If you can prevent a given server from ever going down, then you don’t have to respond to that. There is far too much dependency upon such prevention planning. Systems are quite reliable, but as the number of interconnections grow, so does the potential for combinations of failure. It is a losing proposition, and eventually the result is a catastrophic failure.
The concept of “robustness” is that no server should fail simply because another server goes off line. All servers should be designed in a way that everything they depend upon can disappear at any time. Of course, they depend upon these, so they might be unable to continue their normal function. But they should not crash. They should not go into a state that they can not recover from, and they should not exit requiring a human operator to restart them. Instead, they go into a protected state, much like a turtle going into it’s shell. In this mode, they attempt re-connection with all other dependent servers. When they succeed, they reinitialize themselves, and get back to fully operational mode.
When you think about it, doing anything else is disaster. Imagine you had 100 servers all interconnected such that each server depends upon 6 others. Without robustness, one server down, would cause 6 more to go down. Those would cause 30 more to go down, and before you know it, all the server are down or in a broken state. No rational architect would ever design such a system, and no successful architect ever did.
Instead, robustness is a really simple idea: each and every server is designed with an understanding of what it depends upon. At any point in time one of those might go away. As soon as it detects this, go into protected mode. Keep checking the availability of the dependent servers, and when they are available reinitialize.
Most reliable systems today already do this. The systems that run the internet: the DNS servers. Imagine if they went down when something they depend on going away. They don’t. The mail server running on SMTP, remain running no matter what happens to the other mail servers around them. So, robustness is an idea to stay, and GameDay is an idea to make sure that you have done a proper implementation of this.