• (cs)

    Is this not the story that spawned the saying "make something idiot proof, the world will build a better idiot"?

  • SomeCoder (unregistered) in reply to real_aardvark
    real_aardvark:
    SomeCoder:
    Saaid:
    I'll bite, why was it an 'issue' that they weren't Unix?

    Well I don't know for sure, but I'd guess that an "issue" would be because they used a proprietary OS. If you advertise for an XYZ OS guru vs a Unix guru, you're going to get a lot more qualified people for the Unix.

    You'd think so, wouldn't you?

    Well, you would if you'd never tried it and didn't bother putting the slightest effort into thinking before you comment.

    I'd imagine that XYZ OS gurus are thin on the ground, but if you advertise for a Non-Stop guru, or a VOS guru, or a TPF guru, the chances are that you'll get a sizeable collection of high-quality applicants. (How this is meant to help you when some wet-behind-the-ears guy comes in and pulls boards without thinking is unclear. In my old, VOS, environment, it was at least a salesman. This was good, because we could extort large chunks of his expense account in drunken revenge.)

    Of course, it'll cost you. OS Gurus are different from Enterprise Architects.

    If you advertise for a Unix OS guru, you're in the klartz. First of all, you have to sieve through thousands of flavours of *nix. Then you have to sift through thousands of possible combinations of requirements on your own flavour of *nix.

    Then you have to face up to the fact that 99% of people who claim to be Unix OS gurus are in fact bare-faced lying morons. And it'll still cost you.

    And you won't even get a fault-tolerant system, because you really can't build one of those with standard Unix -- otherwise, the market being the market, somebody would have done so. Double panics all round!

    Why people persist in thinking that Unix is anything other than a clapped-out old 1970s OS in dire need of a bullet through the head (both feet already having been self-sacrificed) is beyond me. I use it, but I don't have to admire it.

    Wow. Maybe you should repeat 4th grade reading comprehension before hitting the submit button on the forum.

    All I said was that I would guess the problem with having a proprietary OS would be that it would be harder to find experts in it than the "mainstream" OS. That is ALL I said.

    I never said Unix was better, I never said that Windows was better and I never did any comparisons at all to Non-Stop or anything else. I also never said that Unix was 100% fault tolerant and I also never said that there was exactly one Unix OS that is almighty.

    Since I'd rather not have to start listing everything I don't say in my posts, maybe you should just start reading what was actually posted rather than jumping to absolutely insane conclusions.

  • haero (unregistered) in reply to SomeCoder

    Looks like you got a bite! "real" him in aardvark! :p

  • Nonymous (unregistered)

    When I see all the fancy redundant systems with dual-everything, in memory patching, redundant self correcting memory with redundant buses, this quote always comes to mind:

    "The more complicated they make the plumbing, the easier it is to stop up the drain."

    CAPTCHA: vereor

  • (cs) in reply to real_aardvark
    real_aardvark:
    By the way -- did anybody comment that 1 is not a prime number yet?
    Got an ETA for its appointment?
  • (cs)

    Am I the only one who thinks this wasn't a fantastically-designed system? Surely if you're going to have redundant power supplies, that should mean that either power supply is capable of powering the entire system? That's certainly the way our (telco carrier-class) box works. If you lose a power supply, that doesn't cut the power to half of the redundant components; it simply removes power supply redundancy. You lose the system only if another power supply fails. Here, it sounds like the loss of power supply A means that you lose the system if power supply B, CPU B, memory B etc fails.

  • Greg (unregistered)

    Great parable!

    The moral of the story is that the "system" is not just the hardware, or just the hardware and the software, but the hardware, the software, the maintenance procedures, the training, and above all, constant practice, practice, practice.

    Firemen know this; they constantly practise for unlikely events.

    (Note to pedants: this is English English, not US English.)

  • (cs)

    Wait, so if the login scripts had never been updated correctly - on a machine that was active - then what about the machine that he was about to plug in? I certainly hope that 1 would stay online until 0 is ready ...

  • nisl (unregistered)

    I work in investment industry and I can assure you that Tandem is still "all the rage". These things power the financial markets, they're at the heart of every major stock and commodity exchange in the world.

    I'm sure there are a few Microsoft customers happy to read this story though - "see? good thing we have to reboot every week - otherwise we'd never get to test our startup scripts"

  • Sutherlands (unregistered) in reply to nisl
    nisl:
    I'm sure there are a few Microsoft customers happy to read this story though - "see? good thing we have to reboot every week - otherwise we'd never get to test our startup scripts"
    Someone needs to teach you the meaning of the word "sure"
  • (cs) in reply to real_aardvark
    real_aardvark:
    Why people persist in thinking that Unix is anything other than a clapped-out old 1970s OS in dire need of a bullet through the head (both feet already having been self-sacrificed) is beyond me. I use it, but I don't have to admire it.
    At least its not Windows. NT's basically a "I wish I was UNIX" copycat with the Win9x look and feel, less buggy, a bit more secure ... but it fails in stuff where UNIX doesn't.

    However, there are lots of OSes out there that actually perform better than the Unix varieties, but sadly they have fallen into disuse for anyone not into the Big Iron field. And those who do are already showing their age; the youngest OS/390 "expert" I know is 45 years old. Besides him, there is a 43 year-old guy who has actually seen a Non-Stop Tandem; he never learned how to use it.

    My other gripe would be the x86 trend of the last 10-15 years. Whatever happened to MIPS? Alpha? XMP? The only RISC vestiges left are ARM, SPARC and the Power processors, and none of these remain on the desktop/workstation side. Even Sun is selling AMD "workstations"...

  • (cs)

    This reminds me of a fun trip to a data center some number of years ago:

    The company I worked for was looking at various data centers for our purposes, and I was invited to check one out with my boss. Having never seen one, I was more than happy to tag along.

    So getting there, they show us their fancy servers along with redundancy after redundancy: Multiple backups storing multiple backups of the data, backup power generators in case the power goes out, backup generators for those generators, etc etc. The guy doing the walkthrough was almost beside himself going down the list of redundancy they have implemented. In order to lose your data, the power had to go down, all the redundant generators had to malfunction, all the backup servers would have to go down, etc. Really impressive!

    That is, until one of the guys in the group points to a wall and goes: "Hey, what's that bright red switch?"

    This shuts the salesguy up real quick and he goes: "Well, I was really hoping you wouldn't notice that. Per fire regulations, in a room this big we need a fire-alarm switch that's highly visible and very easy to operate."

    Guy : "So, what happens if you pull the switch?"

    Salesguy : "It, uh, turns on the fire extinguishers for the entire building."

    Guy : "Spraying all the servers, and their backups, with water?"

    Salesguy : "Uhm, yeah. That's an ongoing battle with the fire dept right now. We're trying to get that out of here. So uh.. please don't touch that switch."

    They didn't get the deal, as far as I know..

  • Peter (unregistered) in reply to nisl

    Still using Token Ring?

    I used to design Token Ring stuff. It was fun at first, until we had to test it. 16 Megabit TR over twisted pair was the Devil's network. Only on good days, when the planets were in alignment, and the moon was full, could you get the maximum specified node count to work error free. This was, of course, in the days of hubs and shared media. No switches yet.

    And FCC testing? Yeah. Token Ring had very strict rise time specs on the waveform. It needed those sharp risetimes to reduce the node-to-node timing jitter. Every node regenerated the waveform, which had to look pretty much the same after it made it all the way around the ring, as it did when it started. Needless to say, those sharp edges and synchronized nodes made for interesting times at the old emissions test site.

    Token Ring. Deterministic, yes, but three times as expensive as twisted pair Ethernet and a right royal PITA. You would have thought that 100 Mbit Ethernet would have been the stake in the heart of 16 Mbit Token Ring. You would be mistaken. I understand Token Ring is still lurking in the basement of banks and Wall Street companies. Good luck to them.

  • Jonathan (unregistered) in reply to nisl

    Tandem aren't the only sellers of this kind of kit. I worked in the Telco industry where Stratus sells a very similar configuration. In fact, each CPU is quadrupled. There are two pairs of CPU. Each pair internally cross-references themselves with their buddy. If there's any disagreement, the pair shuts down and leaves the other pair to continue the job. So every instruction is executed four times in parallel.

    In theory, this works. In fact I performed a memory upgrade once without shutting down the server. It was a test box, so it wasn't necessary - but I wanted to see if it could be done.

    But in practice, 99% of our system failures were software related which of course affected all CPUs identically. I only ever saw one genuine hardware failure. In that one hardware failure case, a CPU board overheated and died. Something went funny in the bus between them at the time of changeover, and the second board fried itself in sympathy.

    = In theory, theory and practice are the same thing. In practice they're not.

  • Tim P (unregistered) in reply to Grovesy

    Great story.

    I want to file it under ‘urban legend’, but I always quite liked the story.

    and it's ability to shove things in

    Addendum (2008-07-22 10:47): **and it's ability to shove things in --

    Ooops I was writing an email while writing this.. must have got the wrong window at some point..

    place the money (unmarked bills) in the bag and place it in the boot of the black Chevrolet if you ever want to see your son aga

    Addendum (2008-07-23 23:42): **if you ever want to see your son aga --

    Sorry, I too was writing another email while replying to this.. must have tabbed to the wrong window at some point.

  • also a coward (unregistered) in reply to anoncow

    The Tandem that we had came complete with D-cell batteries for redundancy to the redundant power supplies. Neat little feature; the system would shut down parts of the system that were less mission critical as the batteries were consumed. Did this bank not buy that feature, or forget to replace them?

  • frustrati (unregistered) in reply to ThePants999
    ThePants999:
    Am I the only one who thinks this wasn't a fantastically-designed system? Surely if you're going to have redundant power supplies, that should mean that either power supply is capable of powering the entire system? That's certainly the way our (telco carrier-class) box works. If you lose a power supply, that doesn't cut the power to half of the redundant components; it simply removes power supply redundancy. You lose the system only if another power supply fails. Here, it sounds like the loss of power supply A means that you lose the system if power supply B, CPU B, memory B etc fails.
    Yes, you are the only one. Mainly because you didn't actually read the article. Hint: The second power supply did not fail because of a the way the system was designed (although it might have happened because the PSUs were not properly identifiable to the maintenance personnell)
  • frustrati (unregistered) in reply to Flash
    Flash:
    Jason:
    Tandem used to offer these great coffee mugs as swag... they had two handles, one on each side.
    And the souvenir pens had writing points at both ends!
    Yeah, and the souvenir notepads had double-sided paper!
  • Endo808 (unregistered)

    Worst. Article. Ever.

    What the heck is happening to the quality of articles on this site, they seem to now be either totally unbeliveable or a complete non story.

    Dumbass new employee can't tell the difference between two marked power supply units despite the fact it was a mission critical action and he could have checked and re-checked what he was gonna do before he flipped the switch and the dumbass still flipped the wrong one! Add to the he then posts that he'd only been at the company a month!

    I award this article the coveted 5 nines on the Epic Fail Scale, making it a 99.999% failure.

    I mean seriously, I surf the internet at work when I'm not supposed to read this garbage?!

    Screw MFD. Screw these awful articles. WTF happened?

  • Kake (unregistered)

    Heh this sounds just like one bank in Finland. Sampo bank has been having problems with almost everything related to customers.

    First their online bank went down. Then atm-cards stopped to work for some customers. After that few major companie's sallary payment was delayed.

  • TimB (unregistered) in reply to frustrati
    frustrati:
    Yes, you are the only one. Mainly because you didn't actually read the article. Hint: The second power supply did not fail because of a the way the system was designed (although it might have happened because the PSUs were not properly identifiable to the maintenance personnell)

    But the first PSU didn't fail at all. So why wasn't the first PSU maintaining power while the second was off? That's how every redundant power supply I've ever seen works. Both PSUs should supply power to the entire system.

  • jimicus (unregistered) in reply to Grafalgar
    Salesguy : "It, uh, turns on the fire extinguishers for the entire building."

    Guy : "Spraying all the servers, and their backups, with water?"

    Salesguy : "Uhm, yeah. That's an ongoing battle with the fire dept right now. We're trying to get that out of here. So uh.. please don't touch that switch."

    The real WTF there is that the redundant backup servers were in the same building. What would have happened in the event of fire?

  • AdT (unregistered) in reply to TimB
    TimB:
    But the *first* PSU didn't fail at all. So why wasn't the first PSU maintaining power while the second was off? That's how every redundant power supply I've ever seen works. Both PSUs should supply power to the entire system.

    Which, on the other hand, could mean that one PSU that goes haywire can fry the entire system.

    In any case, the instructions obviously required the operator to cut power to the socket with the defective CPU, and by accident, he cut power to the other socket. Power failover makes no sense in this respect, so the claim that an entire PSU had to be switched off may just be a minor inaccuracy of the report.

  • anoncow (unregistered) in reply to also a coward

    I've never heard of a minicomputer that was able to run off D-cell batteries! But even if there were some kind of battery backup, it would still run through the switch. The one I turned off. So it would have made no difference.

  • anoncow (unregistered) in reply to AdT
    AdT:
    TimB:
    But the *first* PSU didn't fail at all. So why wasn't the first PSU maintaining power while the second was off? That's how every redundant power supply I've ever seen works. Both PSUs should supply power to the entire system.

    Which, on the other hand, could mean that one PSU that goes haywire can fry the entire system.

    In any case, the instructions obviously required the operator to cut power to the socket with the defective CPU, and by accident, he cut power to the other socket. Power failover makes no sense in this respect, so the claim that an entire PSU had to be switched off may just be a minor inaccuracy of the report.

    The system was designed to be resistant to a single point of failure. Not two points of failure. So one power supply could fail and bring down one half of the machine, and the other half would just keep on trucking. There was no failover between the power supplies.

  • GregW (unregistered) in reply to Jay
    It seems to me that organizations that have a philosophy of "one mistake and you're fired" are likely to quickly end up with employees who exercise absolutely no initiative.

    In fairness, it should be pointed out that some employers want that. And in some environments/departments/roles you might even be able to argue that they're right to do so.

  • illtiz (unregistered) in reply to nisl

    Acually, this is an interesting read on just this thesis and no sarcasm at all: http://www.cs.berkeley.edu/~pattrsn/talks/ROC_ASPLOS_draft4.pdf

  • Rhialto (unregistered) in reply to akatherder
    akatherder:
    Some people are on a mission to criticize any word that they simply disagree with.
    That includes you, apparently.
    akatherder:
    Specifically words that are newer to the lexicon. "Mission" can mean a NASA mission or a big secret government project, and that's it.
    That is not actually true in various ways. "Mission" is not exactly a new word. My dictionary (Collins Concise English Dictionary) lists 10 numbered meanings for "mission" (which I'm not going to copy all). Number 1 reads "1. a specific task or duty assigned to a person or group of people". Other meanings include "6.a. a building in which missionary work is performed" and "7. the dispatch of aircraft or spacecraft to achieve a particular task".
  • Steve Nuchia (unregistered) in reply to nisl

    This one is from the World Series earthquake:

    Date: 3 Nov 89 15:26:52 GMT From: [email protected] (Douglas Humphrey) To: misc.security Subject: Re: Earthquake A Tandem Computers VLX system, a fault-tolerant transaction processing system, fell over flat on its back (this is a big mainframe, maybe 6 cabs of 6 feet tall and 28 inches or so wide, and weighs a LOT).

    It was, of course, still operating just fine flat on its back. The disks were still upright, due to their being shorter and having lower ceters of gravity. From what I have been told, it was uprighted by Tandem CEs and never missed a beat. They had an UPS for power obviously...

    Doug

  • Edss (unregistered)

    "The only difference between a thing that can go wrong, and a thing that cannot possibly go wrong, is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or fix."

  • Steve Nuchia (unregistered) in reply to danixdefcon5

    I was working on a 4160V motor starter a few feet down the lineup from the main incoming breaker for a natural gas transfer station. We had a tool cart set up conveniently near our work -- right in front of the main breaker. I bent down to get a tool from the bottom shelf and as I stood up and turned around simultaneously I felt the pistol-grip operating handle for the main slide into my pants pocket and begin to turn ... then there was a loud bang and it got dark and quiet. Dark and quiet is a bad combination in an industrial facility.

    I didn't get fired either. Nobody was laughing at the time but everybody got over it. The main turbine didn't restart and required an emergency service call from the vendor but that wasn't fundamentally my fault. Counting that, the incident probably cost our customer about $50,000.

    These things happen. People in business understand that. You don't fire a guy who makes an honest mistake unless you have to. If the customer had not gotten over it my boss would probably have had to but it all blew over. I probably should have been disciplined for allowing the work area to be so disorganized but I learned my lesson.

  • Kuba (unregistered) in reply to Martin
    Martin:
    I went to install it, stood behind the server. There were two PSUs. One with a single green LED (Unit 1), another one with a green LED and a blinking Red LED (Unit 2).

    I removed Unit 2, and the server in front of me suddenly got very quiet. I reseated Unit 2, replaced Unit 1, and both were back up - with a lit green LED and a blinking Red LED.

    That thought me to always have a very, very close look at the maintenance manual. Not everything is intuitive.

    The company who provided this system should be liable for any losses incurred due to this. This is akin to labeling the "OFF" switch with an "ON" label and blaming the user for not looking in the manual.

    Hazards, whether safety-related or operational, should be reduced by design, and only when that's not further possible, should one use signs/verbiage.

  • wgc (unregistered) in reply to nisl

    When I worked in the financial industry, we had similar problems with Solaris servers: they were reliable enough that no one wanted to take them down for scheduled maintenance to verify that all the changes would boot correctly.

    That's actually the way Windows sneaked into our data center. At the time it was NT4 so we had our servers on a scheduled weekly reboot, plus that's when we made the switch to distributed redundancy. Even Windows would stay up pretty well if it had been rebooted within a week and if every layer could fail over to at least one copy in a different data center.

  • Kuba (unregistered) in reply to AccessGuru
    AccessGuru:
    Alex:
    no one appreciated the irony that a system so painstakingly designed for uptime had become so downtime-prone

    Downtime-prone? It was up non-stop for three years, and the main reason it went down for 24 hours was human error? Sounds like the people were downtime-prone.

    The system should have protected against simple human error like this. In case of a failed CPU, switching off the wrong PSU should have been prevented or at least delayed somehow.

    This is a case of design oriented only towards machines. Sure the system by itself was quite reliable, but when one adds the human factor needed for maintenance, it was not very reliable at all. Tandem machines were designed as if the maintenance was done by a robot controlled by another Tandem machine. They should have designed for human maintainers, you know...

    This is a BIG WTF, compounded by the fact that TDWTF readers don't see it: it's a blatant, often-committed mistake. I mean, come on, even in DOS you would type del . and it would ask you if you are sure...

  • Kuba (unregistered) in reply to FredSaw
    FredSaw:
    Pez:
    sewiv:
    Not quite clear on the WTF here. He flipped the wrong switch, the system went down. He kept his job? Is that the WTF?
    Jesus. You'd sack someone for a simple mistake anyone could've made? Must suck to work for you.
    I'd sack him, too. He was hired to maintain that computer. It wasn't "a simple mistake". It took an impressive amount of disregard for caution and double-checking to make that mistake. What would have been "simple" is getting it right.

    Dude, the system was designed with total disregard for human factors. Get over it. People make mistakes. Tandem corporation assumed they don't. Chris should have double-checked, but it's not like someone else wouldn't have committed similar snafu later.

    Firing over something like this is totally dumb. People usually learn from their mistakes -- employees with a bunch of (varying) WTFs behind them, if they are otherwise competent, are much more valuable than "green" hires.

  • D C Ross (unregistered) in reply to Crabs
    Crabs:
    The Dell Poweredge servers have a small LCD display that tells you exactly what part is broken if there is a hardware problem. It's pretty sweet. A light on the actual piece would be cool, though, but the extra wiring required for that would be a bit of a pain.
    The new bright-star-in-the-daytime servers have exactly that feature. I recently opened one up to replace a bad RAM module and was surprised to find a little LED lit up right next to the one I needed to pull out.

    I was so impressed I considered pulling out a few extra parts just to see if they would light up too, but managed to restrain myself before making a complete mess of things.

  • (cs) in reply to Steve Nuchia
    Steve Nuchia:
    This one is from the World Series earthquake:

    Date: 3 Nov 89 15:26:52 GMT From: [email protected] (Douglas Humphrey) To: misc.security Subject: Re: Earthquake A Tandem Computers VLX system, a fault-tolerant transaction processing system, fell over flat on its back (this is a big mainframe, maybe 6 cabs of 6 feet tall and 28 inches or so wide, and weighs a LOT).

    It was, of course, still operating just fine flat on its back. The disks were still upright, due to their being shorter and having lower ceters of gravity. From what I have been told, it was uprighted by Tandem CEs and never missed a beat. They had an UPS for power obviously...

    Doug

    Same earthquake, different machine.

    A friend of mine was putzing around with the Stratuses in the data center when the earthquake hit. Apparently, the floor rippled towards him. "Dude," he said (he has a ponytail), "that's the only time I've ever been able to surf in a data center."

    Every single machine went down hard, except for the five Stratuses, which didn't so much as blink an LED. They just kept on running for the next six or seven hours while everything around them was being patched back together.

    Of course, since they were FEPs and now had nothing operational behind them, that wasn't much damn use to anybody. Sometimes Five Nines just doesn't meet the business requirements.

  • Steve (unregistered) in reply to Pez

    Yeah, it does suck to be stupid doesn't it?

  • Steve (unregistered)

    Typical. That's probably why Tandem doesn't exist any longer. People make the mistakes and the computer gets blamed. Sure, blame the computer. It can't defend itself.

    Tandem Employee #1570 (1980-2000)

  • squigs (unregistered) in reply to sewiv

    He shouldn't have been fired. People will occasionally flip the wrong switch. They're only human. It's such a predictable mistake that procedures should be in place to handle this. They weren't. Not his fault

    That and of course, they now have a guy who is going to be extremely careful about every switch he might flip.

  • Robert C (unregistered) in reply to Grovesy

    I worked on a Stratus system for a while - really quite nice boxes and Tandem competitors. They also went in for the duplication of components etc, hot replacement etc. Was amused to see the Stratus support engineer get egg on his face when he tried to demo how you could remove any card and it would carry on. He pulled out the one card that wasn't duplicated, machine died! Luckily for him, it wasn't in production at the time.

    Meanwhile, this sort of thing (and the customs and excise story below) is all part of what configuration management (CMDB) as part of ITIL or service management - knowing which box is running what application/service.

  • Zed (unregistered) in reply to nisl

    Tandem's aren't in use by my company (largest software vendor for financial exchanges in the world) :)

  • Paul (unregistered) in reply to real_aardvark
    real_aardvark:
    Five nines translates to five minutes a year -- guaranteed under non-idiot circumstances.

    Even with Chris' mistake, they'd have been within that ballpark - if the techies at the bank hadn't boobytrapped the computer...

    Really, Chris' mistake was just an 'oops' - nothing to get fired over. A country's ATM machines going offline for 5 minutes once in 3 years is not that big a deal. Really. The fact that the system just couldn't reboot was the big issue. If anyone was going to get the blame, it should have been the people who did that (or the people who told those people not to do a test reboot of the server at 3am on a quiet day)

    Also, Tandem should have put physical interlocks in place to stop switches being turned off or modules being removed if they were the vital ones. (With a key operated 'total shutdown' switch to stop the computer from taking over the world with no way to kill it).

  • Harry (unregistered)

    Tandem used to advertise that you could shoot a bullet into their computers and they would continue to operate, and prove it by doing so.

    Some stock exchanges use their gear; they were bought out by HP a few years ago.

  • (cs) in reply to Paul
    Paul:
    real_aardvark:
    Five nines translates to five minutes a year -- guaranteed under non-idiot circumstances.

    Even with Chris' mistake, they'd have been within that ballpark - if the techies at the bank hadn't boobytrapped the computer...

    Really, Chris' mistake was just an 'oops' - nothing to get fired over. A country's ATM machines going offline for 5 minutes once in 3 years is not that big a deal. Really. The fact that the system just couldn't reboot was the big issue. If anyone was going to get the blame, it should have been the people who did that (or the people who told those people not to do a test reboot of the server at 3am on a quiet day)

    Agreed, wholeheartedly.

    I've done some terrible, terrible things to Stratuses in my time. I've put them into an endless rebooting cycle (as I mentioned in a post about a year ago) because I wanted to go down the pub; I've used them to bring down the entire Compuserve network, almost permanently (on the orders of my PHB. Sadly, humane killing is not yet legal in California); I've used a Z80 line card (one of the few things that isn't duplexed) to spray random instructions into the kernel and wipe it out; I've even watched a colleague use the Motorola level 7 interrupt button without asking him why.

    For about two months, we kept our credit-card processing system up to the latest government regs with a ten-line script I ran through the multi-process debugger.

    But I have never heard of a shop that allowed this level of insane patching on a live system. Not Stratus. Not Tandem. It's pretty much antithetical to the entire ethos.

    Chris is very definitely not to blame. At the very least, he should be complimented for giving the company a water-tight reason to sack the entire support staff.

    BTW ... If that story above about a Stratus engineer pulling out a non-duplexed board is true: well, they've gone downhill a very long way since my days.

  • Kelly (unregistered)

    Yep, these things are fun. 20 years ago I watched in awe as a company in the same building that I worked in moved their tandem from one floor to another without any user ever noticing anything. They'd take one cabinet out of the fabric, wheel it downstairs, clean it, hook it back in, run some checks, bring it up, wait 15 minutes, then go get the next one. Took three days.....

    Now, I have one where I work, and I somehow ended up in charge of it. Since we're a development shop, there's no production workload on it, so when I have a problem I just reboot ("cold load") it. If that doesn't fix it, I just call support (they have the best support hands down of any company I have EVER had to call for help). It have a lot of fun telling them, "hey, yeah, my Fribozz was acting up, so I cold loaded it and that didn't fix it, what next?". They used to damn near have a heart attack at the idea that I cold loaded it, but they're kind of getting used to me now...

    My CE told me a story one day about a custoemr in California that called support after an earthquack and said that his machine had gone down and he needed help getting it back up. They started to walk him through the startup proceedure, and he said "no, no, you don't understand. It fell through the floor and I need help getting it back up on the rest of the floor. It's still running.".

    Factoid: It took me months of usenet searching and asking questions to find the right way to turn one off. Turns out it was in the manuals, but in a place that you'd never find it....

  • Tourist (unregistered) in reply to notromda
    notromda:
    I've done a similar thing with RAID drives... Somehow the order of the drives wasn't what it was supposed to be or something... but when you pull one too many drives from a RAID array, it's not too fun.

    It's now a pet peeve of mine that there's not a better feedback system throughout hardware to indicate faults. Why not have a little led on the hard drive, on the bus cable, on the motherboard... some sort of indicator to say "Hey STUPID! The bad part is right here!"

    I had also an issue once with RAID drives; I was sent to China to update a system, the customer insisted that I do a backup before updating. However what nobody told me before going there was that Norton Ghost didn't work well with RAID drives when doing backup from Windows (at least that old version). No, you had to boot on a floppy to properly ghost the drives so after some fiddling by me, the server wouldn't boot any longer. Then it went from bad to worse: when trying to fix the problem by pulling out the other RAID drive, the other drive got corrupt as well. Of course it turned out the customer had not done any backup since nobody informed him it was his responsibility. The whole plant stood still for two days. Imagine how extremely popular I was.

    Turned out the update was so small that it would have been done in 5 min without the need for backup but customer insisted on backup. doh!

  • mikko (unregistered)

    This has happened to us in less dramatic ways with regular Linux/Intel servers when the uptimes were 'too long'. We've had 400+ days uptimes, during which a number of new services were put on and configured on some systems. Along comes an hardware failure or an electrical blackout compounded by some UPS problem, the machine reboots and does not come up quite the way it was supposed to. Someone wrote an init script for some service or another and never tested rebooting on the actual system (because it was in production), with its dependencies and other services.

    Our work is mostly not that time-critical, we can afford an hour or two downtime for debugging the booting. Problems like this just show that uptime should not be an end in itself. Not that it ever was for us, we just didn't bother rebooting. Rebooting every now and then is good for your, if not for getting in a new kernel, then at least for checking that rebooting works.

  • mikko (unregistered) in reply to Kelly
    Kelly:
    after an earthquack

    That was the mother of all quacks.

  • Bosshog (unregistered)

    From reading all these comments, and speaking here as a computer expert, I have concluded that these Tandems are absolutely A1 bad-butt, and I want one!

Leave a comment on “Designed For Reliability”

Log In or post as a guest

Replying to comment #:

« Return to Article