Tandem Computers were all the rage in the mid-1980s, especially in the banking industry and other high-transaction environments. "While most computer systems have failure rates on the order of a few days," Tandem salespeople would often say, "our NonStop line of computers is designed to fail hundreds of times less. We measure our uptime in years." And they were right: Tandem delivered hardware solutions with virtually no downtime.
This was accomplished through Tandem's engineers' meticulous attention to detail: There was no single point of failure. Each NonStop server had at least two CPUs. Each one had its own memory, its own I/O bus and two connections-in case one failed-to an equally redundant multi-CPU bus that had its own redundant shared memory. Even the OS was specially crafted to allow in-memory changes without the need for a reboot. And naturally, the whole server was powered by two redundant power supplies-just in case. Sure, it was expensive, but it worked-Tandem managed to exceed the coveted Five Nines (99.999 percent) of uptime.
When Chris B. first learned of all this, he was more intimidated than impressed. At his new company-a small consulting firm that developed minicomputer systems and software-he'd be the backup to the backup of the primary Tandem Computer guru.
"It's nothing to worry about," Chris's boss told him. "Those things hardly ever have issues. And in the highly unlikely scenario that you'll personally have to deal with one, they've got incredible support to walk you through anything."
CPU Failure
That assuaged Chris a bit, right up until the point that the impossible happened. The one day that the Tandem Computer guru called in sick, his primary backup had to go off-site to deal with some emergency and his secondary backup, Chris's boss, was overseas. One of the company's larger clients-the biggest bank in South Africa-called up with an issue. Their central ATM network server triggered a "Failed CPU" alarm.
Fortunately, the bank's server running its ATM network was a NonStop computer, relegating the "Failed CPU" error from a mission-critical failure to an annoying red-blip on its monitoring software. South Africa's bank machines still doled out cash on command, but were one CPU failure away from a complete system meltdown. This meant that Chris would need to go to the bank's central offices and fix the problem himself.
Before making the long drive out, Chris stopped by the company's warehouse to pick up the replacement CPU unit. It was a briefcase-sized enclosure that had all sorts of circuit boards and weighed a solid 35 pounds. He also picked up the large maintenance manual to help guide him through the task.
After arriving on site, Chris checked the system out. Per the maintenance guide's simple instructions, he verified that CPU Unit 0 had, in fact, failed and needed to be swapped out. To do that, all he'd need to do was flip the switch on Power Supply Unit (PSU) 0, pull out the CPU unit, slide the new one in and flip back the switch on the PSU.
"Simple enough," Chris shrugged, and began the procedure.
Holdup at the ATM
Before the "click" of the switch even hit his ears, Chris had a stark realization: He had inadvertently switched off PSU 1, bringing the total number of operational CPU units to zero. Though he immediately turned the PSU back on, the damage was already done: The NonStop server had been stopped and most of South Africa's ATMs would not function until it was fully started again.
Normally, this would mean that ATM customers would see an "out of service" message for a good five minutes while the NonStop rebooted. However, this was the first time in three-and-a-half years that the computer had been rebooted.
Since the last reboot, the bank's developers and IT staff had applied several upgrades and changes to the system and ATM software. Occasionally, they'd only apply the changes to the in-memory program-using that neat feature of the Tandem OS-and neglect to add the changes to the boot script. Other times, they'd make a typo, perhaps a misplaced comma or semicolon, when updating the boot script.
While the NonStop server quickly started back up, its mission-critical applications crashed and burned. It took nearly 24 long hours of collaboration between the bank's IT staff, Chris and Chris's colleagues before the ATM server was fully operational again.
Fortunately, Chris didn't take the heat for the massive amount of downtime. The bank certainly learned its lesson and instituted all sorts of policies to prevent such an occurrence from happening again. Still, no one appreciated the irony that a system so painstakingly designed for uptime had become so downtime-prone.
Designed For Reliability was originally published in Alex's DevDisasters column in the May 15, 2008 issue of Redmond Developer News. RDN is a free magazine for influential readers and provides insight into Microsoft's plans, and news on the latest happenings and products in the Windows development marketplace.