Fragile Software

I was reminded again today of just how fragile software is. For some reason, a piece of code which had been working reliably for months suddenly stopped working, because a standard library refused to link properly. Nothing had changed in months. We eventually traced the problem to a corrupted cache, which we fixed by removing completely.

We would never tolerate this kind of behavior in many other kinds of systems. Contrary to what Hollywood would have you believe, shooting a hole in a car doesn't cause it to explode. Despite the enormous energy contained in a spinning jet engine, a bird flying into it won't cause pieces of turbine to go into low-Earth orbit (though it won't be good for the bird, either). Skyscrapers don't tumble to the ground every time a toilet overflows.

So why can't we build software which will deal gracefully with minor--or even major--failures without crashing? The simple answer is, we can, but we don't.

There are a number of reasons why not, but they basically boil down to (a) money, and (b) you can get away without it.

On the money side, despite the frustration of software which crashes, most people will still buy it. This has as much to do with the lack of better alternatives in many cases, plus the fact that developing less fragile software takes more money and more discipline. No developer will make that effort unless customers force it to. Never mind the expense of rebuilding the entire infrastructure. After all, having one piece of robust software is useless if the rest of the platform is fragile from motherboard on up.

And developers can get away with fragile software because a software crash is (usually) much less catastrophic than a dam collapsing or an airplane exploding. The software environment is also much more predictable than the real world. In the real world, stuff happens. Birds fly into jet engines, toilets get plugged up, and so forth. Most software has a pretty good idea of what data will be sent to it, so when something truly unexpected happens, it is less likely to be something the designer anticipated and tested for. Jet engine designers know that air sometimes arrives with hard bird-shaped lumps; but the software designer doesn't always anticipate that the inputs to a module might be malformed in a particular way which sends the software equivalent of red-hot turbine blades hurling into outer space.

Skyscrapers are built using a series of vertical columns to support the weight of the building. These buildings are designed so that you can cut any one of the columns (or even more than one column) without sending the building crashing to the ground. The people inside might not even notice. What does it take to achieve the same effect in software?

Google does it by assuming that individual servers will fail regularly, and designing a data center which automatically compensates for any dead server. The two key components are encapsulation and redundancy. Every component (hardware and software) has to be encapsulated from other components so that the failure of any one doesn't break something else. And every critical component--meaning components which are necessary for the continued operation of the system as a whole or important parts of it--has to be replicated, so that something else can take up the slack.

In other words, assume that anything can and will fail in an unexpected way. Design hardware and software to detect and deal with failure gracefully.

Oh yeah, one more thing. Don't expect Utility Computing to work until you take the fragility out of software.

Previous
Previous

A Sad Day for Me, A Sad Day for Minnesota Republicans

Next
Next

Was this really the plan?