A Classic Production Failure

Since I got tied up on a lovely production failure yesterday (hence the Classic), I figured today'd be the perfect day to rehash "//TODO: Uncomment Later", originally pulished on March 14th, 2007.

What was your worst production failure?

Mine was a missing WHERE clause in a certain DELETE query that resulted in the deletion of all 6,000+ rows instead of just one. Whoops. Of course, my blunder only resulted in several hours of lost work, a painful data restore, and one really embarrassed junior developer (me). I'm sure, had I been in an environment like Adam's, I would have deferred to someone with more expertise or, at the very least, been much, much more careful.

Adam works for a large steel manufacturer. If you haven't been inside of a steel mill before, I can tell you first hand, it's quite an interesting place. The plants themselves easily span several hundred acres, each housing quite a few gigantic, airplane-hangar sized buildings. Some are for melting iron. Others are used for steel production. And then there's casting. And roughing. And rolling. And, of course, a whole bunch are for support facilities like water filtration and electricity generation. And just about everything is automated.

Of course, as an employee of the steel company, Adam doesn't work directly with the machine controlling software. Because those are such critical systems, only Extremely Highly Paid Consultants are entrusted with such work. And probably for the better. I, for one, sleep soundly knowing that my software isn't responsible for pouring giant vats of molten metal into lines of rail cars.

Generally speaking, manufacturers and the consultants do a very good job ensuring that their equipment operates safely and at full capacity. A control system (usually a PLC) receives measurements from various sensors -- temperature, vibration, speed, pressure, fuel flow, etc -- and adjusts the system's parameters when things go slightly awry. In the rare case that something goes seriously wrong, the controller shuts down the equipment by sending a "trip" signal to a separate relay that's connected to all sorts of other shutdown devices (including, of course, The Big Red Button). With all of the redundancy built in, uncontrolled problems virtually never occur.

That said, it was a pretty big deal when one of the steel mill's gas-fired turbines crashed, causing an extensive fire, millions of dollars in lost production, and tens of millions in equipment damage. The good news was that the turbine was shutdown before it ripped itself apart and sent 1500+ degree shrapnel in every direction. After an extensive investigation by a team of experts (also Extremely Highly Paid Consultants), it was concluded that the incident was fluke. No one was to blame.

A few months later, another gas turbine crashed. And then another. The steel manufacturer was starting to get a bit irked at these multi-million dollar "flukes" (not to mention the serious risk of life and limb) and an all-out investigation was commenced. Every last detail of the gas turbine and its related equipment was scrutinized. Eventually, they were able to trace the problem to a "watchdog" timer.

Every few seconds or so, the turbine's control system would send a "heartbeat" signal to the watchdog timer in a separate safety system. If the watchdog didn't receive a signal as expected, the safety system would sound an alarm and alert the operator. In these particular cases, the control system software crashed (generally fixed by, what else, a reboot of the control system computers) and the watchdog timer did not alert the safety system.

They went to the consulting company that developed the turbine control system software (and the corresponding watchdog timer). As it turned it, one of the consultants commented out the entire watchdog timer program. It was a lot easier to debug the control system that way, apparently. The consultant simply forgot to un-comment the code before deploying it to the safety system. Whoops.

But wait, there's more! In response to this article, reader G.R.G. (from Insecurity Doors, Mystery of the High Test Scores, Saving A Few Minutes, and so many more) tells of his worst production failure.

Long ago, I worked as a programmer at a university’s hearing research lab. They were awarded a large government grant to study the effects of different kinds of noise on hearing. For the really loud and really faint noises, the researchers used animal subjects with ears that are similar to human ears. Specifically, chinchillas.

The chinchillas would be put in to a special chamber for several hours at a time to have their hearing tested. Since the little rodents don’t respond so well to questions like, “which sound is louder?,” a good amount of time had to be spent training them to jump over a little bar in their chamber whenever they heard a beep.

Because a large part of the research project was to study the long term effects of hearing, the tests would have to be run twenty-four hours a day, seven days a week, for several years. Obviously, it was pretty important that the chinchilla testing be automated. But not very important, though. If it had been very important, they would have had someone other than a grad student write it.

I joined the team about a year into the project and was tasked with rewriting the beep-jump-reward program. It was a ridiculous mess of spaghetti code that seemed to have more GOTO statements than actual code. There were no comments anywhere nor any documentation on what the program’s algorithm was for controlling the beeps and rewards.

After a little while, I was able to figure out the algorithm and rewrite the application. A month or two later, the rewrite was put into production. I documented my work, said my goodbyes, and moved on to my next contract.

A year or so later, the researchers compiled the data and noticed some very surprising results: the chinchillas were a lot more hearing-impaired than they should have been. While this may not seem too big a deal, the findings would have some serious ramifications. Occupational noise-exposure laws would be changed, lawsuits would be filed, and billions would be spent correcting the issue.

Before publishing the results, another team of researchers went over the data and study with a fine-toothed comb to ensure that the results were correct. And whammo, they find a bug in my code. Under certain conditions, one part of the application did not correctly check that the chinchilla jumped at the right time. This meant that the program would deny the chinchilla a food pellet, giving it negative feedback when it in-fact did the right thing. This led to so some rather confused chinchillas which had no idea when they were actually supposed to jump.

In the end, over a year’s worth of data was thrown out, a few man-years of work was wasted, and there were a whole lot of cute little rodents that were rather confused and hard-of-hearing. I still feel bad for deafening those poor chinchillas...

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!