- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Is this not the story that spawned the saying "make something idiot proof, the world will build a better idiot"?
Admin
Wow. Maybe you should repeat 4th grade reading comprehension before hitting the submit button on the forum.
All I said was that I would guess the problem with having a proprietary OS would be that it would be harder to find experts in it than the "mainstream" OS. That is ALL I said.
I never said Unix was better, I never said that Windows was better and I never did any comparisons at all to Non-Stop or anything else. I also never said that Unix was 100% fault tolerant and I also never said that there was exactly one Unix OS that is almighty.
Since I'd rather not have to start listing everything I don't say in my posts, maybe you should just start reading what was actually posted rather than jumping to absolutely insane conclusions.
Admin
Looks like you got a bite! "real" him in aardvark! :p
Admin
When I see all the fancy redundant systems with dual-everything, in memory patching, redundant self correcting memory with redundant buses, this quote always comes to mind:
"The more complicated they make the plumbing, the easier it is to stop up the drain."
CAPTCHA: vereor
Admin
Admin
Am I the only one who thinks this wasn't a fantastically-designed system? Surely if you're going to have redundant power supplies, that should mean that either power supply is capable of powering the entire system? That's certainly the way our (telco carrier-class) box works. If you lose a power supply, that doesn't cut the power to half of the redundant components; it simply removes power supply redundancy. You lose the system only if another power supply fails. Here, it sounds like the loss of power supply A means that you lose the system if power supply B, CPU B, memory B etc fails.
Admin
Great parable!
The moral of the story is that the "system" is not just the hardware, or just the hardware and the software, but the hardware, the software, the maintenance procedures, the training, and above all, constant practice, practice, practice.
Firemen know this; they constantly practise for unlikely events.
(Note to pedants: this is English English, not US English.)
Admin
Wait, so if the login scripts had never been updated correctly - on a machine that was active - then what about the machine that he was about to plug in? I certainly hope that 1 would stay online until 0 is ready ...
Admin
I work in investment industry and I can assure you that Tandem is still "all the rage". These things power the financial markets, they're at the heart of every major stock and commodity exchange in the world.
I'm sure there are a few Microsoft customers happy to read this story though - "see? good thing we have to reboot every week - otherwise we'd never get to test our startup scripts"
Admin
Admin
However, there are lots of OSes out there that actually perform better than the Unix varieties, but sadly they have fallen into disuse for anyone not into the Big Iron field. And those who do are already showing their age; the youngest OS/390 "expert" I know is 45 years old. Besides him, there is a 43 year-old guy who has actually seen a Non-Stop Tandem; he never learned how to use it.
My other gripe would be the x86 trend of the last 10-15 years. Whatever happened to MIPS? Alpha? XMP? The only RISC vestiges left are ARM, SPARC and the Power processors, and none of these remain on the desktop/workstation side. Even Sun is selling AMD "workstations"...
Admin
This reminds me of a fun trip to a data center some number of years ago:
The company I worked for was looking at various data centers for our purposes, and I was invited to check one out with my boss. Having never seen one, I was more than happy to tag along.
So getting there, they show us their fancy servers along with redundancy after redundancy: Multiple backups storing multiple backups of the data, backup power generators in case the power goes out, backup generators for those generators, etc etc. The guy doing the walkthrough was almost beside himself going down the list of redundancy they have implemented. In order to lose your data, the power had to go down, all the redundant generators had to malfunction, all the backup servers would have to go down, etc. Really impressive!
That is, until one of the guys in the group points to a wall and goes: "Hey, what's that bright red switch?"
This shuts the salesguy up real quick and he goes: "Well, I was really hoping you wouldn't notice that. Per fire regulations, in a room this big we need a fire-alarm switch that's highly visible and very easy to operate."
Guy : "So, what happens if you pull the switch?"
Salesguy : "It, uh, turns on the fire extinguishers for the entire building."
Guy : "Spraying all the servers, and their backups, with water?"
Salesguy : "Uhm, yeah. That's an ongoing battle with the fire dept right now. We're trying to get that out of here. So uh.. please don't touch that switch."
They didn't get the deal, as far as I know..
Admin
Still using Token Ring?
I used to design Token Ring stuff. It was fun at first, until we had to test it. 16 Megabit TR over twisted pair was the Devil's network. Only on good days, when the planets were in alignment, and the moon was full, could you get the maximum specified node count to work error free. This was, of course, in the days of hubs and shared media. No switches yet.
And FCC testing? Yeah. Token Ring had very strict rise time specs on the waveform. It needed those sharp risetimes to reduce the node-to-node timing jitter. Every node regenerated the waveform, which had to look pretty much the same after it made it all the way around the ring, as it did when it started. Needless to say, those sharp edges and synchronized nodes made for interesting times at the old emissions test site.
Token Ring. Deterministic, yes, but three times as expensive as twisted pair Ethernet and a right royal PITA. You would have thought that 100 Mbit Ethernet would have been the stake in the heart of 16 Mbit Token Ring. You would be mistaken. I understand Token Ring is still lurking in the basement of banks and Wall Street companies. Good luck to them.
Admin
Tandem aren't the only sellers of this kind of kit. I worked in the Telco industry where Stratus sells a very similar configuration. In fact, each CPU is quadrupled. There are two pairs of CPU. Each pair internally cross-references themselves with their buddy. If there's any disagreement, the pair shuts down and leaves the other pair to continue the job. So every instruction is executed four times in parallel.
In theory, this works. In fact I performed a memory upgrade once without shutting down the server. It was a test box, so it wasn't necessary - but I wanted to see if it could be done.
But in practice, 99% of our system failures were software related which of course affected all CPUs identically. I only ever saw one genuine hardware failure. In that one hardware failure case, a CPU board overheated and died. Something went funny in the bus between them at the time of changeover, and the second board fried itself in sympathy.
= In theory, theory and practice are the same thing. In practice they're not.
Admin
Great story.
place the money (unmarked bills) in the bag and place it in the boot of the black Chevrolet if you ever want to see your son aga
Addendum (2008-07-23 23:42): **if you ever want to see your son aga --
Sorry, I too was writing another email while replying to this.. must have tabbed to the wrong window at some point.
Admin
The Tandem that we had came complete with D-cell batteries for redundancy to the redundant power supplies. Neat little feature; the system would shut down parts of the system that were less mission critical as the batteries were consumed. Did this bank not buy that feature, or forget to replace them?
Admin
Admin
Admin
Worst. Article. Ever.
What the heck is happening to the quality of articles on this site, they seem to now be either totally unbeliveable or a complete non story.
Dumbass new employee can't tell the difference between two marked power supply units despite the fact it was a mission critical action and he could have checked and re-checked what he was gonna do before he flipped the switch and the dumbass still flipped the wrong one! Add to the he then posts that he'd only been at the company a month!
I award this article the coveted 5 nines on the Epic Fail Scale, making it a 99.999% failure.
I mean seriously, I surf the internet at work when I'm not supposed to read this garbage?!
Screw MFD. Screw these awful articles. WTF happened?
Admin
Heh this sounds just like one bank in Finland. Sampo bank has been having problems with almost everything related to customers.
First their online bank went down. Then atm-cards stopped to work for some customers. After that few major companie's sallary payment was delayed.
Admin
But the first PSU didn't fail at all. So why wasn't the first PSU maintaining power while the second was off? That's how every redundant power supply I've ever seen works. Both PSUs should supply power to the entire system.
Admin
The real WTF there is that the redundant backup servers were in the same building. What would have happened in the event of fire?
Admin
Which, on the other hand, could mean that one PSU that goes haywire can fry the entire system.
In any case, the instructions obviously required the operator to cut power to the socket with the defective CPU, and by accident, he cut power to the other socket. Power failover makes no sense in this respect, so the claim that an entire PSU had to be switched off may just be a minor inaccuracy of the report.
Admin
I've never heard of a minicomputer that was able to run off D-cell batteries! But even if there were some kind of battery backup, it would still run through the switch. The one I turned off. So it would have made no difference.
Admin
The system was designed to be resistant to a single point of failure. Not two points of failure. So one power supply could fail and bring down one half of the machine, and the other half would just keep on trucking. There was no failover between the power supplies.
Admin
In fairness, it should be pointed out that some employers want that. And in some environments/departments/roles you might even be able to argue that they're right to do so.
Admin
Acually, this is an interesting read on just this thesis and no sarcasm at all: http://www.cs.berkeley.edu/~pattrsn/talks/ROC_ASPLOS_draft4.pdf
Admin
Admin
This one is from the World Series earthquake:
Date: 3 Nov 89 15:26:52 GMT From: [email protected] (Douglas Humphrey) To: misc.security Subject: Re: Earthquake A Tandem Computers VLX system, a fault-tolerant transaction processing system, fell over flat on its back (this is a big mainframe, maybe 6 cabs of 6 feet tall and 28 inches or so wide, and weighs a LOT).
It was, of course, still operating just fine flat on its back. The disks were still upright, due to their being shorter and having lower ceters of gravity. From what I have been told, it was uprighted by Tandem CEs and never missed a beat. They had an UPS for power obviously...
Doug
Admin
"The only difference between a thing that can go wrong, and a thing that cannot possibly go wrong, is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or fix."
Admin
I was working on a 4160V motor starter a few feet down the lineup from the main incoming breaker for a natural gas transfer station. We had a tool cart set up conveniently near our work -- right in front of the main breaker. I bent down to get a tool from the bottom shelf and as I stood up and turned around simultaneously I felt the pistol-grip operating handle for the main slide into my pants pocket and begin to turn ... then there was a loud bang and it got dark and quiet. Dark and quiet is a bad combination in an industrial facility.
I didn't get fired either. Nobody was laughing at the time but everybody got over it. The main turbine didn't restart and required an emergency service call from the vendor but that wasn't fundamentally my fault. Counting that, the incident probably cost our customer about $50,000.
These things happen. People in business understand that. You don't fire a guy who makes an honest mistake unless you have to. If the customer had not gotten over it my boss would probably have had to but it all blew over. I probably should have been disciplined for allowing the work area to be so disorganized but I learned my lesson.
Admin
The company who provided this system should be liable for any losses incurred due to this. This is akin to labeling the "OFF" switch with an "ON" label and blaming the user for not looking in the manual.
Hazards, whether safety-related or operational, should be reduced by design, and only when that's not further possible, should one use signs/verbiage.
Admin
When I worked in the financial industry, we had similar problems with Solaris servers: they were reliable enough that no one wanted to take them down for scheduled maintenance to verify that all the changes would boot correctly.
That's actually the way Windows sneaked into our data center. At the time it was NT4 so we had our servers on a scheduled weekly reboot, plus that's when we made the switch to distributed redundancy. Even Windows would stay up pretty well if it had been rebooted within a week and if every layer could fail over to at least one copy in a different data center.
Admin
The system should have protected against simple human error like this. In case of a failed CPU, switching off the wrong PSU should have been prevented or at least delayed somehow.
This is a case of design oriented only towards machines. Sure the system by itself was quite reliable, but when one adds the human factor needed for maintenance, it was not very reliable at all. Tandem machines were designed as if the maintenance was done by a robot controlled by another Tandem machine. They should have designed for human maintainers, you know...
This is a BIG WTF, compounded by the fact that TDWTF readers don't see it: it's a blatant, often-committed mistake. I mean, come on, even in DOS you would type del . and it would ask you if you are sure...
Admin
Dude, the system was designed with total disregard for human factors. Get over it. People make mistakes. Tandem corporation assumed they don't. Chris should have double-checked, but it's not like someone else wouldn't have committed similar snafu later.
Firing over something like this is totally dumb. People usually learn from their mistakes -- employees with a bunch of (varying) WTFs behind them, if they are otherwise competent, are much more valuable than "green" hires.
Admin
I was so impressed I considered pulling out a few extra parts just to see if they would light up too, but managed to restrain myself before making a complete mess of things.
Admin
A friend of mine was putzing around with the Stratuses in the data center when the earthquake hit. Apparently, the floor rippled towards him. "Dude," he said (he has a ponytail), "that's the only time I've ever been able to surf in a data center."
Every single machine went down hard, except for the five Stratuses, which didn't so much as blink an LED. They just kept on running for the next six or seven hours while everything around them was being patched back together.
Of course, since they were FEPs and now had nothing operational behind them, that wasn't much damn use to anybody. Sometimes Five Nines just doesn't meet the business requirements.
Admin
Yeah, it does suck to be stupid doesn't it?
Admin
Typical. That's probably why Tandem doesn't exist any longer. People make the mistakes and the computer gets blamed. Sure, blame the computer. It can't defend itself.
Tandem Employee #1570 (1980-2000)
Admin
He shouldn't have been fired. People will occasionally flip the wrong switch. They're only human. It's such a predictable mistake that procedures should be in place to handle this. They weren't. Not his fault
That and of course, they now have a guy who is going to be extremely careful about every switch he might flip.
Admin
I worked on a Stratus system for a while - really quite nice boxes and Tandem competitors. They also went in for the duplication of components etc, hot replacement etc. Was amused to see the Stratus support engineer get egg on his face when he tried to demo how you could remove any card and it would carry on. He pulled out the one card that wasn't duplicated, machine died! Luckily for him, it wasn't in production at the time.
Meanwhile, this sort of thing (and the customs and excise story below) is all part of what configuration management (CMDB) as part of ITIL or service management - knowing which box is running what application/service.
Admin
Tandem's aren't in use by my company (largest software vendor for financial exchanges in the world) :)
Admin
Even with Chris' mistake, they'd have been within that ballpark - if the techies at the bank hadn't boobytrapped the computer...
Really, Chris' mistake was just an 'oops' - nothing to get fired over. A country's ATM machines going offline for 5 minutes once in 3 years is not that big a deal. Really. The fact that the system just couldn't reboot was the big issue. If anyone was going to get the blame, it should have been the people who did that (or the people who told those people not to do a test reboot of the server at 3am on a quiet day)
Also, Tandem should have put physical interlocks in place to stop switches being turned off or modules being removed if they were the vital ones. (With a key operated 'total shutdown' switch to stop the computer from taking over the world with no way to kill it).
Admin
Tandem used to advertise that you could shoot a bullet into their computers and they would continue to operate, and prove it by doing so.
Some stock exchanges use their gear; they were bought out by HP a few years ago.
Admin
I've done some terrible, terrible things to Stratuses in my time. I've put them into an endless rebooting cycle (as I mentioned in a post about a year ago) because I wanted to go down the pub; I've used them to bring down the entire Compuserve network, almost permanently (on the orders of my PHB. Sadly, humane killing is not yet legal in California); I've used a Z80 line card (one of the few things that isn't duplexed) to spray random instructions into the kernel and wipe it out; I've even watched a colleague use the Motorola level 7 interrupt button without asking him why.
For about two months, we kept our credit-card processing system up to the latest government regs with a ten-line script I ran through the multi-process debugger.
But I have never heard of a shop that allowed this level of insane patching on a live system. Not Stratus. Not Tandem. It's pretty much antithetical to the entire ethos.
Chris is very definitely not to blame. At the very least, he should be complimented for giving the company a water-tight reason to sack the entire support staff.
BTW ... If that story above about a Stratus engineer pulling out a non-duplexed board is true: well, they've gone downhill a very long way since my days.
Admin
Yep, these things are fun. 20 years ago I watched in awe as a company in the same building that I worked in moved their tandem from one floor to another without any user ever noticing anything. They'd take one cabinet out of the fabric, wheel it downstairs, clean it, hook it back in, run some checks, bring it up, wait 15 minutes, then go get the next one. Took three days.....
Now, I have one where I work, and I somehow ended up in charge of it. Since we're a development shop, there's no production workload on it, so when I have a problem I just reboot ("cold load") it. If that doesn't fix it, I just call support (they have the best support hands down of any company I have EVER had to call for help). It have a lot of fun telling them, "hey, yeah, my Fribozz was acting up, so I cold loaded it and that didn't fix it, what next?". They used to damn near have a heart attack at the idea that I cold loaded it, but they're kind of getting used to me now...
My CE told me a story one day about a custoemr in California that called support after an earthquack and said that his machine had gone down and he needed help getting it back up. They started to walk him through the startup proceedure, and he said "no, no, you don't understand. It fell through the floor and I need help getting it back up on the rest of the floor. It's still running.".
Factoid: It took me months of usenet searching and asking questions to find the right way to turn one off. Turns out it was in the manuals, but in a place that you'd never find it....
Admin
I had also an issue once with RAID drives; I was sent to China to update a system, the customer insisted that I do a backup before updating. However what nobody told me before going there was that Norton Ghost didn't work well with RAID drives when doing backup from Windows (at least that old version). No, you had to boot on a floppy to properly ghost the drives so after some fiddling by me, the server wouldn't boot any longer. Then it went from bad to worse: when trying to fix the problem by pulling out the other RAID drive, the other drive got corrupt as well. Of course it turned out the customer had not done any backup since nobody informed him it was his responsibility. The whole plant stood still for two days. Imagine how extremely popular I was.
Turned out the update was so small that it would have been done in 5 min without the need for backup but customer insisted on backup. doh!
Admin
This has happened to us in less dramatic ways with regular Linux/Intel servers when the uptimes were 'too long'. We've had 400+ days uptimes, during which a number of new services were put on and configured on some systems. Along comes an hardware failure or an electrical blackout compounded by some UPS problem, the machine reboots and does not come up quite the way it was supposed to. Someone wrote an init script for some service or another and never tested rebooting on the actual system (because it was in production), with its dependencies and other services.
Our work is mostly not that time-critical, we can afford an hour or two downtime for debugging the booting. Problems like this just show that uptime should not be an end in itself. Not that it ever was for us, we just didn't bother rebooting. Rebooting every now and then is good for your, if not for getting in a new kernel, then at least for checking that rebooting works.
Admin
That was the mother of all quacks.
Admin
From reading all these comments, and speaking here as a computer expert, I have concluded that these Tandems are absolutely A1 bad-butt, and I want one!