| « 2.9: Just Say It | tblTimesheet » |
Tandem Computers were all the rage in the mid-1980s, especially in the banking industry and other high-transaction environments. "While most computer systems have failure rates on the order of a few days," Tandem salespeople would often say, "our NonStop line of computers is designed to fail hundreds of times less. We measure our uptime in years." And they were right: Tandem delivered hardware solutions with virtually no downtime.
This was accomplished through Tandem's engineers' meticulous attention to detail: There was no single point of failure. Each NonStop server had at least two CPUs. Each one had its own memory, its own I/O bus and two connections-in case one failed-to an equally redundant multi-CPU bus that had its own redundant shared memory. Even the OS was specially crafted to allow in-memory changes without the need for a reboot. And naturally, the whole server was powered by two redundant power supplies-just in case. Sure, it was expensive, but it worked-Tandem managed to exceed the coveted Five Nines (99.999 percent) of uptime.
When Chris B. first learned of all this, he was more intimidated than impressed. At his new company-a small consulting firm that developed minicomputer systems and software-he'd be the backup to the backup of the primary Tandem Computer guru.
"It's nothing to worry about," Chris's boss told him. "Those things hardly ever have issues. And in the highly unlikely scenario that you'll personally have to deal with one, they've got incredible support to walk you through anything."
That assuaged Chris a bit, right up until the point that the impossible happened. The one day that the Tandem Computer guru called in sick, his primary backup had to go off-site to deal with some emergency and his secondary backup, Chris's boss, was overseas. One of the company's larger clients-the biggest bank in South Africa-called up with an issue. Their central ATM network server triggered a "Failed CPU" alarm.
Fortunately, the bank's server running its ATM network was a NonStop computer, relegating the "Failed CPU" error from a mission-critical failure to an annoying red-blip on its monitoring software. South Africa's bank machines still doled out cash on command, but were one CPU failure away from a complete system meltdown. This meant that Chris would need to go to the bank's central offices and fix the problem himself.
Before making the long drive out, Chris stopped by the company's warehouse to pick up the replacement CPU unit. It was a briefcase-sized enclosure that had all sorts of circuit boards and weighed a solid 35 pounds. He also picked up the large maintenance manual to help guide him through the task.
After arriving on site, Chris checked the system out. Per the maintenance guide's simple instructions, he verified that CPU Unit 0 had, in fact, failed and needed to be swapped out. To do that, all he'd need to do was flip the switch on Power Supply Unit (PSU) 0, pull out the CPU unit, slide the new one in and flip back the switch on the PSU.
"Simple enough," Chris shrugged, and began the procedure.
Before the "click" of the switch even hit his ears, Chris had a stark realization: He had inadvertently switched off PSU 1, bringing the total number of operational CPU units to zero. Though he immediately turned the PSU back on, the damage was already done: The NonStop server had been stopped and most of South Africa's ATMs would not function until it was fully started again.
Normally, this would mean that ATM customers would see an "out of service" message for a good five minutes while the NonStop rebooted. However, this was the first time in three-and-a-half years that the computer had been rebooted.
Since the last reboot, the bank's developers and IT staff had applied several upgrades and changes to the system and ATM software. Occasionally, they'd only apply the changes to the in-memory program-using that neat feature of the Tandem OS-and neglect to add the changes to the boot script. Other times, they'd make a typo, perhaps a misplaced comma or semicolon, when updating the boot script.
While the NonStop server quickly started back up, its mission-critical applications crashed and burned. It took nearly 24 long hours of collaboration between the bank's IT staff, Chris and Chris's colleagues before the ATM server was fully operational again.
Fortunately, Chris didn't take the heat for the massive amount of downtime. The bank certainly learned its lesson and instituted all sorts of policies to prevent such an occurrence from happening again. Still, no one appreciated the irony that a system so painstakingly designed for uptime had become so downtime-prone.
Designed For Reliability was originally published in Alex's DevDisasters column in the May 15, 2008 issue of Redmond Developer News. RDN is a free magazine for influential readers and provides insight into Microsoft's plans, and news on the latest happenings and products in the Windows development marketplace.
No the WTF is that a system purchased because of it's uptime ability had been abused, so much so that the one time it did go down (accidently) it couldn't come back up again. If the other engineers had fixed up the boot scripts and carefully checked and tested their changes the mainframe should have just come straight back up again. ... A friend of mine used to manage Tandem machines; he's told me the story several times about his company purchased another company. Some ten years later it came around to shutting down the data centre of the company that had been brought out. They found a Tandem mainframe stuck in an old disused room, they had no idea what it did why it was there and anyone who would have known during the merger had long since gone... It had become part of the furniture. So someone had the bright idea of just shutting it down... Several hours later the Port of Dover was backed up as a small, but very vital part of the Custom & Exercises system had mysteriously gone missing, all Hell had broken loose at his company as they had been fingered as the people who now ran the service… though no one seemed to be aware that they did, let alone where the machine was. That Tandem has been sitting there for some ten years crunching data and spitting results back out again… I want to file it under ‘urban legend’, but I always quite liked the story. and it's ability to shove things in Addendum (2008-07-22 10:47): **and it's ability to shove things in -- Ooops I was writing an email while writing this.. must have got the wrong window at some point.. |
|
Back in the 90's, there was a bomb in the world trade centre. Apparently there were a whole bunch of Tandem machines in the basement -- and they all were blown over by the force of the blast. But they kept on working. They are very solid machines indeed.
By the way, I'm 'Chris B' :) ... and yes, I felt very foolish when I flipped the wrong switch. I'd only been at the company for a month or so! |
Re: Designed For Reliability
2008-07-22 11:00
•
by
Yoooder
(unregistered)
|
|
Didn't TFA mention him already being the redundant-redundant-redundant person?
The guru had his #1 (redundant) and #2 (redundant's redundancy) people both unavailable, so it fell on the redundant-redundancy's redundant machine maintainer to assure redundancy of the redundant system was restored and redundantly available? Ouch. |
Re: Designed For Reliability
2008-07-22 11:16
•
by
FlyboyFred
(unregistered)
|
Let's clean up the language. It should be "buttuaged." |
|
I work in investment industry and I can assure you that Tandem is still "all the rage". These things power the financial markets, they're at the heart of every major stock and commodity exchange in the world.
I'm sure there are a few Microsoft customers happy to read this story though - "see? good thing we have to reboot every week - otherwise we'd never get to test our startup scripts" |
| « 2.9: Just Say It | tblTimesheet » |