• (cs)

    When I read "No single point of failure", I knew something was wrong...

  • jtl (unregistered)

    His kingdom for an 'Are you sure?' dialog.

  • B-Rad (unregistered)

    I love the irony of the redundant backup support person taking down the redundant CPU.

  • sewiv (unregistered)

    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?

  • sir_flexalot (unregistered)

    I guess now they have molly guards labeled "1" and "0" for people who don't know the difference and would inadvertently turn of 1 instead of 0...

  • Martin (unregistered)

    Reminds me on a problem we had in our server room a few years back.

    Server had redundant PSUs, one of them failed. Monitoring told me, i ordered the backup FRU#, it arrived later the same day.

    I went to install it, stood behind the server. There were two PSUs. One with a single green LED (Unit 1), another one with a green LED and a blinking Red LED (Unit 2).

    I removed Unit 2, and the server in front of me suddenly got very quiet. I reseated Unit 2, replaced Unit 1, and both were back up - with a lit green LED and a blinking Red LED.

    That thought me to always have a very, very close look at the maintenance manual. Not everything is intuitive.

  • Steve (unregistered)

    Reminds me of when my job was to install software changes to paint robots in-between shifts at auto plants...

    You had to put your faith in the last guy that was there to update the server with the right code. I couldn't watch when the first ghost job was ran...

  • (cs)

    "Mission-critical"?

    What do they think this is, NASA?

  • Larry Yates (unregistered)

    HP NonStop (Tandem) computers are STILL the most reliable machines in the marketplace for mission critical enterprises.

  • TrollScore (unregistered)

    The real WTF is that Chris, an obvious slaptard, wasn't guillotined for blatant incompetence. Sure, the in-memory committed changes are ignorant; I've seen the same thing with CCIE's and their startup-config/running-config antics but there's no reason to Chris shouldn't have been executed for crimes against humanity.

  • (cs) in reply to sewiv
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?

    No the WTF is that a system purchased because of it's uptime ability had been abused, so much so that the one time it did go down (accidently) it couldn't come back up again.

    If the other engineers had fixed up the boot scripts and carefully checked and tested their changes the mainframe should have just come straight back up again.

    ...

    A friend of mine used to manage Tandem machines; he's told me the story several times about his company purchased another company. Some ten years later it came around to shutting down the data centre of the company that had been brought out. They found a Tandem mainframe stuck in an old disused room, they had no idea what it did why it was there and anyone who would have known during the merger had long since gone... It had become part of the furniture. So someone had the bright idea of just shutting it down...

    Several hours later the Port of Dover was backed up as a small, but very vital part of the Custom & Exercises system had mysteriously gone missing, all Hell had broken loose at his company as they had been fingered as the people who now ran the service… though no one seemed to be aware that they did, let alone where the machine was.

    That Tandem has been sitting there for some ten years crunching data and spitting results back out again…

    I want to file it under ‘urban legend’, but I always quite liked the story.

    and it's ability to shove things in

    Addendum (2008-07-22 10:47): **and it's ability to shove things in --

    Ooops I was writing an email while writing this.. must have got the wrong window at some point..

  • (cs)
    Alex:
    no one appreciated the irony that a system so painstakingly designed for uptime had become so downtime-prone

    Downtime-prone? It was up non-stop for three years, and the main reason it went down for 24 hours was human error? Sounds like the people were downtime-prone.

  • Paul W. Homer (unregistered)

    This just goes to show that purchasing the fancy hardware all by itself doesn't really help you if you don't know how to use it properly. What is needed is a real understanding of operations, and a consistent reasonable process for running and upgrading the system. Sloppy band-aid practices based on weak assumptions generally fail at some point.

    If the bank had been really wise, they would have had their own Tandem expert on-site as well, just in case the consultant showing up was a bit unprepared. Two minds are always better than one :-)

    Paul. http://theprogrammersparadox.blogspot.com

  • anoncow (unregistered)

    Back in the 90's, there was a bomb in the world trade centre. Apparently there were a whole bunch of Tandem machines in the basement -- and they all were blown over by the force of the blast. But they kept on working. They are very solid machines indeed.

    By the way, I'm 'Chris B' :) ... and yes, I felt very foolish when I flipped the wrong switch. I'd only been at the company for a month or so!

  • Alonzo Turing (unregistered) in reply to Paul W. Homer

    It would have helped to switch off the right PSU, too.

  • RJ (unregistered)

    I worked for "a large fast food corporation" that used a Tandem to process data for its company-owned restaurants (never call them stores...).

    As the article said, it was an excellent machine that would continue processing even if there was a problem with some of its hardware. The only problem was that the software had to be written with 'checkpoints' - places where the OS would save information so that if a CPU or memory module went bad it could resume processing from that point. Other hardware issues (Power supply, hard drive, etc) didn't need checkpoints.

    Of course it had a totally proprietary operating system (named Guardian) and a proprietary programing language (TACL) in addition to relatively standard COBOL and C compilers.

    The main issue with this generation of Tandem computers was that it wasn't Unix - they did introduce Unix systems, but IIRC they weren't nearly as popular. The Tandem corporation was bought by Compaq in 1997 - a year before Compaq bought Digital Equipment and 4 years before Compaq 'merged' with HP. (see the Campaq entry at wikipedia for the timetable)

  • (cs)
    That assuaged Chris a bit
    He would probably have preferred that it assuage his fears. </pedantry>
  • Pez (unregistered) in reply to sewiv
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?

    Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.

  • biziclop (unregistered) in reply to B-Rad
    B-Rad:
    I love the irony of the redundant backup support person taking down the redundant CPU.

    TRWTF would've been if the redundant backup support person taking down the redundant CPU was made redundant after this.

  • Brady Kelly (proudly in Jo'burg) (unregistered) in reply to anoncow
    anoncow:
    Back in the 90's, there was a bomb in the world trade centre. Apparently there were a whole bunch of Tandem machines in the basement -- and they all were blown over by the force of the blast. But they kept on working. They are very solid machines indeed.

    By the way, I'm 'Chris B' :) ... and yes, I felt very foolish when I flipped the wrong switch. I'd only been at the company for a month or so!

    Better than the wrong where clause.
  • (cs) in reply to Brady Kelly (proudly in Jo'burg)
    Brady Kelly (proudly in Jo'burg):
    Better than the wrong where clause.

    ...or no where clause at all.

  • (cs) in reply to Pez
    Pez:
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?
    Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.
    I'd sack him, too. He was hired to maintain that computer. It wasn't "a simple mistake". It took an impressive amount of disregard for caution and double-checking to make that mistake. What would have been "simple" is getting it right.
  • Yoooder (unregistered) in reply to biziclop

    Didn't TFA mention him already being the redundant-redundant-redundant person?

    The guru had his #1 (redundant) and #2 (redundant's redundancy) people both unavailable, so it fell on the redundant-redundancy's redundant machine maintainer to assure redundancy of the redundant system was restored and redundantly available?

    Ouch.

  • (cs) in reply to FredSaw
    FredSaw:
    That assuaged Chris a bit
    He would probably have preferred that it assuage his fears. </pedantry>
    <pedantry> You missed the open pedantry tag. </pedantry>
  • (cs) in reply to Yoooder
    Yoooder:
    Didn't TFA mention him already being the redundant-redundant-redundant person?

    The guru had his #1 (redundant) and #2 (redundant's redundancy) people both unavailable, so it fell on the redundant-redundancy's redundant machine maintainer to assure redundancy of the redundant system was restored and redundantly available?

    Ouch.

    An observation from the DURRD (Department of Unnecessarily Repititious Redundancy Department).

  • (cs)

    I've done a similar thing with RAID drives... Somehow the order of the drives wasn't what it was supposed to be or something... but when you pull one too many drives from a RAID array, it's not too fun.

    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"

  • E (unregistered)

    This reminds me of an outage we had. It was a large-scale trading system. The servers had redundant power supplies. The admin who installed them plugged them both into the same power distribution module (basically a fancy power strip) in the rack. The module failed. We lost all of the hosts plugged into it. That was around his third major blunder like that. We let him go.

  • KD (unregistered)

    Clearly this WTF is about over-reliance on perceived quality (as in "of course the Titanic doesn't need lifeboats for everyone - its unsinkable!"). However, I think this story also highlights an unwritten rule in the IT profession: no system, however painstakingly designed, can withstand the destructive force of an idiot who's found the power switch.

  • (cs)

    Pfft. If you want reliability go for a TPF system.

  • shame (unregistered) in reply to anoncow

    Sometimes consequences are more severe - cases of shutting down the wrong engine: http://aviation-safety.net/database/dblist.php?Event=ACEW

  • (cs) in reply to notromda
    notromda:
    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"

    I'd like to see you design and implement the Universal Hardware Fault Indicator (UHFI) spec. I bet it will be easy!

  • FlyboyFred (unregistered) in reply to FredSaw
    FredSaw:
    That assuaged Chris a bit
    He would probably have preferred that it assuage his fears. </pedantry>

    Let's clean up the language. It should be "buttuaged."

  • Walleye (unregistered) in reply to sir_flexalot
    sir_flexalot:
    I guess now they have molly guards labeled "1" and "0" for people who don't know the difference and would inadvertently turn of 1 instead of 0...

    Actually, they're labelled "0", "1" and "File not found"

  • Morasique (unregistered) in reply to Pez

    A guy that's worked for the company a month goes out on his first repair job and takes down all the ATMs for an entire country. I probably would've fired him too; at the very least you'd think the bank would be more annoyed

  • anon (unregistered) in reply to AccessGuru
    AccessGuru:
    notromda:
    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"

    I'd like to see you design and implement the Universal Hardware Fault Indicator (UHFI) spec. I bet it will be easy!

    Actually, most new server hardware (from the big boys like HP and IBM) already has that... The newer HP servers have LEDs for failed power supplies, fans, hard drives, memory sticks, PCI bus, etc.

  • Morasique (unregistered) in reply to Markp

    What's with the sudden Internet trend of making fun of the phrase "mission-critical"? Do people really not know what it means?

  • Jason (unregistered)

    Tandem used to offer these great coffee mugs as swag... they had two handles, one on each side.

  • Andrew (unregistered) in reply to RJ
    RJ:
    I worked for "a large fast food corporation" that used a Tandem to process data for its company-owned restaurants (never call them stores...).

    As the article said, it was an excellent machine that would continue processing even if there was a problem with some of its hardware. The only problem was that the software had to be written with 'checkpoints' - places where the OS would save information so that if a CPU or memory module went bad it could resume processing from that point. Other hardware issues (Power supply, hard drive, etc) didn't need checkpoints.

    Of course it had a totally proprietary operating system (named Guardian) and a proprietary programing language (TACL) in addition to relatively standard COBOL and C compilers.

    The main issue with this generation of Tandem computers was that it wasn't Unix - they did introduce Unix systems, but IIRC they weren't nearly as popular. The Tandem corporation was bought by Compaq in 1997 - a year before Compaq bought Digital Equipment and 4 years before Compaq 'merged' with HP. (see the Campaq entry at wikipedia for the timetable)

    This makes me think of the parallel programming articles today. They tell us how we'll all need to re-learn programming, and think in parallel. This article reminds us that special multi-processor programming is nothing new.

  • anoncow (unregistered) in reply to Morasique

    I was hired as a software engineer, not a hardware guy. Only desperate circumstances forced the company to send me out on a hardware call.

  • GregW (unregistered)

    "Tandem Computers were all the rage in the mid-1980s, ..."

    And the mid-1990s, and the mid-2000s, too.

    "Jumentum?" There are so many inappropriate jokes available for that captcha.

  • Andrew (unregistered) in reply to notromda
    notromda:
    I've done a similar thing with RAID drives... Somehow the order of the drives wasn't what it was supposed to be or something... but when you pull one too many drives from a RAID array, it's not too fun.

    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"

    Something caused the motherboard to fail. It probably also damaged the power on the unit. Do you want to rely on a defective LED light on that motherboard to show what to remove?

    I'd rather have an LED go off to indicate a bad motherboard. I'm not an EE, and can't be sure this is the best way either.

  • (cs) in reply to Andrew
    Andrew:
    This makes me think of the parallel programming articles today. They tell us how we'll all need to re-learn programming, and think in parallel. This article reminds us that special multi-processor programming is nothing new.
    Of course it's not new. What's new is that more and more programmers are having to think that way. (Alas, it seems that the number of people who can't handle it is increasing too.)

    And just occasionally someone figures out how to parallelize something new. That's when the field genuinely advances. It doesn't happen that often though.

  • Steve (unregistered)

    Back in the day, I lusted after Tandem systems mightily. We could have used them instead of our IBM Big Iron.

    IMHO, TRWTF is the fact that we seem to have in most cases retrogressed from the kind of reliability that Tandem and some other similar vendors supplied to the largely sloppy and slipshod environments that we often see described in these pages.

  • Steve (unregistered) in reply to GregW
    GregW:
    "Jumentum?" There are so many inappropriate jokes available for that captcha.
    Isn't that what Joseph Leiberman claimed to have in 2004? Just before his campaign spun in?

    Oh, I guess that was "Joe-menutum".

    Never mind.

  • (cs) in reply to notromda
    notromda:
    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"
    Nearly all do. Place the blame on whoever specced out that kind of server. I've never heard of a hotswap unit that didn't have a failure indicator.
  • Ben4jammin (unregistered) in reply to FredSaw
    FredSaw:
    Pez:
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?
    Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.
    I'd sack him, too. He was hired to maintain that computer. It wasn't "a simple mistake". It took an impressive amount of disregard for caution and double-checking to make that mistake. What would have been "simple" is getting it right.

    While I agree that it wasn't a "simple" mistake, I would hire this guy in a second. Why? Someone who has been through such an "oh dear God" moment usually double checks stuff the rest of their life. I'm a network admin and everyone in our IT dept has had a "dear God" moment. After that, you ALWAYS backup/RTFM/double check or whatever the situation calls for. That said, the first time is a learning experience...the second is a RGE (resume generating event).

  • Calvin Spealman (unregistered)

    The WTF here is that the two CPU and power units were behind the same door. Solution: two doors with a one-open-at-a-time mechanism.

  • (cs) in reply to FredSaw
    FredSaw:
    Pez:
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?
    Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.
    I'd sack him, too. He was hired to maintain that computer. It wasn't "a simple mistake". It took an impressive amount of disregard for caution and double-checking to make that mistake. What would have been "simple" is getting it right.

    This board is not a fan of reading comprehension is it?

    From the article

    When Chris B. first learned of all this, he was more intimidated than impressed. At his new company-a small consulting firm that developed minicomputer systems and software-he'd be the backup to the backup of the primary Tandem Computer guru.

    How many small consulting firms do you know that hire backups to backups as their sole job? Isn't it highly likely that he was hired to do something else (write software perhaps) and was the de facto backup to the backup?

    Yeah, let's fire that software guy who was hired to write software, who botched a hardware issue that WE thrust upon him.

    /I probably wouldn't fire the main guru if it was the first mistake of this kind

  • (cs)

    Seeing this story reminded me about discussion I have seen on the Risks List about aircraft safety systems.

    A while ago there was a thread about the redundant safety systems that kept planes flying when various elements went kaput, by adapting to and circumventing faults.

    But the main issue was that once these system have run out of wiggle room, and if the support measures are always behind the scenes, then the next failure will most likely be catastrophic.

    Thus your backup systems need to keep people very aware when protection measures have kicked in.

  • (cs) in reply to taylonr
    taylonr:
    This board is not a fan of reading comprehension is it?

    From the article

    When Chris B. first learned of all this, he was more intimidated than impressed. At his new company-a small consulting firm that developed minicomputer systems and software-he'd be the backup to the backup of the primary Tandem Computer guru.
    How many small consulting firms do you know that hire backups to backups as their sole job? Isn't it highly likely that he was hired to do something else (write software perhaps) and was the de facto backup to the backup?

    Yeah, let's fire that software guy who was hired to write software, who botched a hardware issue that WE thrust upon him.

    /I probably wouldn't fire the main guru if it was the first mistake of this kind

    From the article:"Chris B. was hired to write software. In addition, if there ever was a time when the two server gurus were not available, Chris might be called on to help out with any problems there."

    Oh... wait... that isn't what the article said, is it. Reading comprehension, indeed.

Leave a comment on “Designed For Reliability”

Log In or post as a guest

Replying to comment #:

« Return to Article