- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
When I read "No single point of failure", I knew something was wrong...
Admin
His kingdom for an 'Are you sure?' dialog.
Admin
I love the irony of the redundant backup support person taking down the redundant CPU.
Admin
Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?
Admin
I guess now they have molly guards labeled "1" and "0" for people who don't know the difference and would inadvertently turn of 1 instead of 0...
Admin
Reminds me on a problem we had in our server room a few years back.
Server had redundant PSUs, one of them failed. Monitoring told me, i ordered the backup FRU#, it arrived later the same day.
I went to install it, stood behind the server. There were two PSUs. One with a single green LED (Unit 1), another one with a green LED and a blinking Red LED (Unit 2).
I removed Unit 2, and the server in front of me suddenly got very quiet. I reseated Unit 2, replaced Unit 1, and both were back up - with a lit green LED and a blinking Red LED.
That thought me to always have a very, very close look at the maintenance manual. Not everything is intuitive.
Admin
Reminds me of when my job was to install software changes to paint robots in-between shifts at auto plants...
You had to put your faith in the last guy that was there to update the server with the right code. I couldn't watch when the first ghost job was ran...
Admin
"Mission-critical"?
What do they think this is, NASA?
Admin
HP NonStop (Tandem) computers are STILL the most reliable machines in the marketplace for mission critical enterprises.
Admin
The real WTF is that Chris, an obvious slaptard, wasn't guillotined for blatant incompetence. Sure, the in-memory committed changes are ignorant; I've seen the same thing with CCIE's and their startup-config/running-config antics but there's no reason to Chris shouldn't have been executed for crimes against humanity.
Admin
No the WTF is that a system purchased because of it's uptime ability had been abused, so much so that the one time it did go down (accidently) it couldn't come back up again.
If the other engineers had fixed up the boot scripts and carefully checked and tested their changes the mainframe should have just come straight back up again.
...
A friend of mine used to manage Tandem machines; he's told me the story several times about his company purchased another company. Some ten years later it came around to shutting down the data centre of the company that had been brought out. They found a Tandem mainframe stuck in an old disused room, they had no idea what it did why it was there and anyone who would have known during the merger had long since gone... It had become part of the furniture. So someone had the bright idea of just shutting it down...
Several hours later the Port of Dover was backed up as a small, but very vital part of the Custom & Exercises system had mysteriously gone missing, all Hell had broken loose at his company as they had been fingered as the people who now ran the service… though no one seemed to be aware that they did, let alone where the machine was.
That Tandem has been sitting there for some ten years crunching data and spitting results back out again…
I want to file it under ‘urban legend’, but I always quite liked the story.
and it's ability to shove things in
Addendum (2008-07-22 10:47): **and it's ability to shove things in --
Ooops I was writing an email while writing this.. must have got the wrong window at some point..
Admin
Downtime-prone? It was up non-stop for three years, and the main reason it went down for 24 hours was human error? Sounds like the people were downtime-prone.
Admin
This just goes to show that purchasing the fancy hardware all by itself doesn't really help you if you don't know how to use it properly. What is needed is a real understanding of operations, and a consistent reasonable process for running and upgrading the system. Sloppy band-aid practices based on weak assumptions generally fail at some point.
If the bank had been really wise, they would have had their own Tandem expert on-site as well, just in case the consultant showing up was a bit unprepared. Two minds are always better than one :-)
Paul. http://theprogrammersparadox.blogspot.com
Admin
Back in the 90's, there was a bomb in the world trade centre. Apparently there were a whole bunch of Tandem machines in the basement -- and they all were blown over by the force of the blast. But they kept on working. They are very solid machines indeed.
By the way, I'm 'Chris B' :) ... and yes, I felt very foolish when I flipped the wrong switch. I'd only been at the company for a month or so!
Admin
It would have helped to switch off the right PSU, too.
Admin
I worked for "a large fast food corporation" that used a Tandem to process data for its company-owned restaurants (never call them stores...).
As the article said, it was an excellent machine that would continue processing even if there was a problem with some of its hardware. The only problem was that the software had to be written with 'checkpoints' - places where the OS would save information so that if a CPU or memory module went bad it could resume processing from that point. Other hardware issues (Power supply, hard drive, etc) didn't need checkpoints.
Of course it had a totally proprietary operating system (named Guardian) and a proprietary programing language (TACL) in addition to relatively standard COBOL and C compilers.
The main issue with this generation of Tandem computers was that it wasn't Unix - they did introduce Unix systems, but IIRC they weren't nearly as popular. The Tandem corporation was bought by Compaq in 1997 - a year before Compaq bought Digital Equipment and 4 years before Compaq 'merged' with HP. (see the Campaq entry at wikipedia for the timetable)
Admin
Admin
Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.
Admin
TRWTF would've been if the redundant backup support person taking down the redundant CPU was made redundant after this.
Admin
Admin
...or no where clause at all.
Admin
Admin
Didn't TFA mention him already being the redundant-redundant-redundant person?
The guru had his #1 (redundant) and #2 (redundant's redundancy) people both unavailable, so it fell on the redundant-redundancy's redundant machine maintainer to assure redundancy of the redundant system was restored and redundantly available?
Ouch.
Admin
Admin
Admin
I've done a similar thing with RAID drives... Somehow the order of the drives wasn't what it was supposed to be or something... but when you pull one too many drives from a RAID array, it's not too fun.
It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"
Admin
This reminds me of an outage we had. It was a large-scale trading system. The servers had redundant power supplies. The admin who installed them plugged them both into the same power distribution module (basically a fancy power strip) in the rack. The module failed. We lost all of the hosts plugged into it. That was around his third major blunder like that. We let him go.
Admin
Clearly this WTF is about over-reliance on perceived quality (as in "of course the Titanic doesn't need lifeboats for everyone - its unsinkable!"). However, I think this story also highlights an unwritten rule in the IT profession: no system, however painstakingly designed, can withstand the destructive force of an idiot who's found the power switch.
Admin
Pfft. If you want reliability go for a TPF system.
Admin
Sometimes consequences are more severe - cases of shutting down the wrong engine: http://aviation-safety.net/database/dblist.php?Event=ACEW
Admin
I'd like to see you design and implement the Universal Hardware Fault Indicator (UHFI) spec. I bet it will be easy!
Admin
Let's clean up the language. It should be "buttuaged."
Admin
Actually, they're labelled "0", "1" and "File not found"
Admin
A guy that's worked for the company a month goes out on his first repair job and takes down all the ATMs for an entire country. I probably would've fired him too; at the very least you'd think the bank would be more annoyed
Admin
Actually, most new server hardware (from the big boys like HP and IBM) already has that... The newer HP servers have LEDs for failed power supplies, fans, hard drives, memory sticks, PCI bus, etc.
Admin
What's with the sudden Internet trend of making fun of the phrase "mission-critical"? Do people really not know what it means?
Admin
Tandem used to offer these great coffee mugs as swag... they had two handles, one on each side.
Admin
This makes me think of the parallel programming articles today. They tell us how we'll all need to re-learn programming, and think in parallel. This article reminds us that special multi-processor programming is nothing new.
Admin
I was hired as a software engineer, not a hardware guy. Only desperate circumstances forced the company to send me out on a hardware call.
Admin
"Tandem Computers were all the rage in the mid-1980s, ..."
And the mid-1990s, and the mid-2000s, too.
"Jumentum?" There are so many inappropriate jokes available for that captcha.
Admin
Something caused the motherboard to fail. It probably also damaged the power on the unit. Do you want to rely on a defective LED light on that motherboard to show what to remove?
I'd rather have an LED go off to indicate a bad motherboard. I'm not an EE, and can't be sure this is the best way either.
Admin
And just occasionally someone figures out how to parallelize something new. That's when the field genuinely advances. It doesn't happen that often though.
Admin
Back in the day, I lusted after Tandem systems mightily. We could have used them instead of our IBM Big Iron.
IMHO, TRWTF is the fact that we seem to have in most cases retrogressed from the kind of reliability that Tandem and some other similar vendors supplied to the largely sloppy and slipshod environments that we often see described in these pages.
Admin
Oh, I guess that was "Joe-menutum".
Never mind.
Admin
Admin
While I agree that it wasn't a "simple" mistake, I would hire this guy in a second. Why? Someone who has been through such an "oh dear God" moment usually double checks stuff the rest of their life. I'm a network admin and everyone in our IT dept has had a "dear God" moment. After that, you ALWAYS backup/RTFM/double check or whatever the situation calls for. That said, the first time is a learning experience...the second is a RGE (resume generating event).
Admin
The WTF here is that the two CPU and power units were behind the same door. Solution: two doors with a one-open-at-a-time mechanism.
Admin
This board is not a fan of reading comprehension is it?
From the article
How many small consulting firms do you know that hire backups to backups as their sole job? Isn't it highly likely that he was hired to do something else (write software perhaps) and was the de facto backup to the backup?
Yeah, let's fire that software guy who was hired to write software, who botched a hardware issue that WE thrust upon him.
/I probably wouldn't fire the main guru if it was the first mistake of this kind
Admin
Seeing this story reminded me about discussion I have seen on the Risks List about aircraft safety systems.
A while ago there was a thread about the redundant safety systems that kept planes flying when various elements went kaput, by adapting to and circumventing faults.
But the main issue was that once these system have run out of wiggle room, and if the support measures are always behind the scenes, then the next failure will most likely be catastrophic.
Thus your backup systems need to keep people very aware when protection measures have kicked in.
Admin
Oh... wait... that isn't what the article said, is it. Reading comprehension, indeed.