What was your worst production failure?

Mine was a missing WHERE clause in a certain DELETE query that resulted in the deletion of all 6,000+ rows instead of just one. Whoops. Of course, my blunder only resulted in several hours of lost work, a painful data restore, and one really embarrassed junior developer (me). I'm sure, had I been in an environment like Adam's, I would have deferred to someone with more expertise or, at the very least, been much, much more careful.

Adam works for a large steel manufacturer. If you haven't been inside of a steel mill before, I can tell you first hand, it's quite an interesting place. The plants themselves easily span several hundred acres, each housing quite a few gigantic, airplane-hangar sized buildings. Some are for melting iron. Others are used for steel production. And then there's casting. And roughing. And rolling. And, of course, a whole bunch are for support facilities like water filtration and electricity generation. And just about everything is automated.

Of course, as an employee of the steel company, Adam doesn't work directly with the machine controlling software. Because those are such critical systems, only Extremely Highly Paid Consultants are entrusted with such work. And probably for the better. I, for one, sleep soundly knowing that my software isn't responsible for pouring giant vats of molten metal into lines of rail cars.

Generally speaking, manufacturers and the consultants do a very good job ensuring that their equipment operates safely and at full capacity. A control system (usually a PLC) receives measurements from various sensors -- temperature, vibration, speed, pressure, fuel flow, etc -- and adjusts the system's parameters when things go slightly awry. In the rare case that something goes seriously wrong, the controller shuts down the equipment by sending a "trip" signal to a separate relay that's connected to all sorts of other shutdown devices (including, of course, The Big Red Button). With all of the redundancy built in, uncontrolled problems virtually never occur.

That said, it was a pretty big deal when one of the steel mill's gas-fired turbines crashed, causing an extensive fire, millions of dollars in lost production, and tens of millions in equipment damage. The good news was that the turbine was shutdown before it ripped itself apart and sent 1500+ degree shrapnel in every direction. After an extensive investigation by a team of experts (also Extremely Highly Paid Consultants), it was concluded that the incident was fluke. No one was to blame.

A few months later, another gas turbine crashed. And then another. The steel manufacturer was starting to get a bit irked at these multi-million dollar "flukes" (not to mention the serious risk of life and limb) and an all-out investigation was commenced. Every last detail of the gas turbine and its related equipment was scrutinized. Eventually, they were able to trace the problem to a "watchdog" timer.

Every few seconds or so, the turbine's control system would send a "heartbeat" signal to the watchdog timer in a separate safety system. If the watchdog didn't receive a signal as expected, the safety system would sound an alarm and alert the operator. In these particular cases, the control system software crashed (generally fixed by, what else, a reboot of the control system computers) and the watchdog timer did not alert the safety system.

They went to the consulting company that developed the turbine control system software (and the corresponding watchdog timer). As it turned it, one of the consultants commented out the entire watchdog timer program. It was a lot easier to debug the control system that way, apparently. The consultant simply forgot to un-comment the code before deploying it to the safety system. Whoops.


On a lighter note, Eric (my company's intern) just told me that all of the stickers from Free Sticker Week have been mailed. They're still available for free (details here). Eric did have one small request, though: "And please, make sure you send a stamped, self-addressed envelope with your request. If I have to label/seal/address one more @#!%*&@ envelope, god knows what I will do. Now, if you’ll all excuse me, I have a triple venti mocha backflip latte to fetch." And Eric, no whip cream on the latte, please!

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!