• bvs23bkv33 (unregistered)

    frist frist, just to be redundant

  • Gargravarr (unregistered)

    'Frazzled looking contractors', or quite literally frazzled?

  • RobyMcAndrew (unregistered)

    A friend of mine worked on the traffic boards for the M25. The spec said two separate sets of power and comms, one down the hard shoulder; one down the central reservation. The contractor thought "it would be much cheaper to run both sets of cables down one trench". From here the stories converge.

  • Rogan (unregistered)

    I worked at the local electrical utility for a few years, looking after their control room. As you can imagine, redundancy was the word there too! In the age of coaxial Ethernet, we had redundant networking, redundant MMI (man machine interface) workstations, redundant ADM servers, redundant COMM servers, etc, etc. All fed from redundant power (blue and red), each of blue and red power circuits had their own 6kVA UPS. The two 6kVA UPS's were backed up by a 16kVA UPS, which could be fed from any of 2 independent electrical infeeds (2 different islands!), as well as by a generator. Should be good, right? Um ...

    So, every Friday, the generator got tested. 9am, switch off the two infeeds, power on the genny, make sure it was working fine, then turn off and switch back to the primary infeed. Midnight on Friday, I get a call, everything has gone dark!

    Rush out to find that the 3 UPS's are all drained, and so everything is off. Huh? Turns out, they forgot to turn the primary infeed back on after testing the genny. The UPS's were beeping like crazy all day and into the evening, and the folks in the control room didn't say anything about it because (well, their explanation was somewhat incoherent). Eventually the batteries were drained, and everything went silent. At which point I got the call . . .

  • Rfoxmich (unregistered)

    Never underestimate the power of a backhoe to ruin the best laid plans of mice and men.

  • That one guy (unregistered)

    TRWTF is that management didn't dig in their heels say testing and training is a waste of time.

  • radarbob (unregistered)

    "The UPS's were beeping like crazy all day and into the evening, and the folks in the control room didn't say anything about it because (well, their explanation was somewhat incoherent). "

    So these guys got fired and they all got work at Three Mile Island.

  • Bitter Like Quinine (unregistered)

    Ah yes, one place I worked used to record these as "JCB errors". I remember there were several in the time I was there.

  • dkf (nodebb)

    Did the emergency lighting still work? It's “fun” when it turns out that that also depends on main power…

  • steve taylor (google) in reply to Rogan

    In my case they had parked a big flat screen television in front of the UPS warning panel and couldn't hear it or see it.

  • Jeremy Hannon (google)

    I worked where we had a different but somewhat similar situation with our data connections. This was one of those type of locations that had to be up 24/7/365. Data connections were also important, though not quite as important as power.

    Plans were made. The new data center had all of the power redundancy protections in place. The specialized cooling systems in place, even a false roof to protect it from the otherwise non-removable sprinkler system that was part of the building (so that it would divert water if the actual sprinkler system was in place). The new fiber lines came into the room near the same point, but exited the room and the building on different sides, going down different streets to different central offices so that none of this digging in the wrong place could mess us up.

    Unfortunately what we didn't know is that both of those COs had all their traffic routed through a different data center nearly 40 miles away - the same one, which apparently the carrier didn't do great testing on their own backup structures. A huge power failure and none of their generator's came on line. We ended up with two data connections that were fully functional with no place to route their traffic. Big telecom companies were TRWTF that day.

  • I dunno LOL ¯\(°_o)/¯ (unregistered)

    TRWTF was that this WTF was discovered before the site went live, and not two years later.

  • verisimilidude (unregistered)

    Similar story: a trading firm buys dual fiber circuits into New York so that they won't get cut off from the markets. A backhoe in New Jersey takes out both since the dual circuits were both running through the same trench.

  • jmm (unregistered) in reply to Rfoxmich

    They owed that backhoe a large load of thanks-- it helped them figure out the flaws before they became critical.

  • Oliver Jones (google)

    I once worked for a place that was really proud of their two data centers, for backup and failover. In the same flood plain. With redundant power vendors with substations in the same flood plain.

    Guess what happened one spring?

  • Bobjob (unregistered)

    I remember reading about the power supply for a hospital. There was battery backup for long enough for dual generators to start up, and the generators could provide enough power to run all essential lighting and equipment while the fuel lasted. The system was commissioned, tested and worked without fault over a long period. Then one day, after the hospital was fully operational (no pun intended), the mains power failed. The battery backup kicked in, the generators started and all was well... for a while - then the generators stopped and the batteries were already exhausted. Fortunately, the mains power was restored quite quickly. It took quite a while to work out the cause of the problem. There were pumps that fed fuel from the main fuel tanks to the smaller fuel reservoirs alongside the generators - they were wired into the mains - they should, of course, have been supplied from the generators.

  • Sir Ooopsies (unregistered)

    Wasn't sure what a "backhoe" was and misread it as "blackhoe" and Google it on image search at work. My advice to you, don't do it.

    P.S.: I'm glad "he were able" to figure it out. Might want to proof-read the grammar in your articles, just sayin'

  • Sir Ooopsies (unregistered)

    *Googled (and yes I might want to proof-read grammar in my comments before posting them, I know...

  • DCL (unregistered)

    Guys What's with the 24/7/365? I understand 24 hours a day. I understand 7 days a week. I'm having trouble with the 365 weeks a year. Either write 24/7/52 or 24/365 < / pet peeve>

  • tldr; (unregistered) in reply to DCL

    Simpler: uptime needs to be 1/0. Downtime naturally 0/1.

  • Your Name (unregistered) in reply to DCL

    Shouldn't that be 365 weeks/month?

  • Quite (unregistered) in reply to DCL

    The trouble is that there are more than 52 weeks a year. So either one or two days a year of a 24/7/52 regime can be downtime ...

  • Kashim (unregistered)

    So the lesson between the article and all of the comments: There is always a single point of failure. Even if that point is a nuclear warhead going off in orbit that EMPs all of your sites, there is always a single point of failure.

  • This happend a year ago in belgium (unregistered)

    http://deredactie.be/cm/vrtnieuws.english/News/1.2355065

    TLDR; when the mains power went out, a ups and a dieselgroup kicked in. A miswiring in the dieselgroup not only caused the critical machines to fall silent, but silent forever because of huge power spikes, rendering the machines broken....

  • Ex-lurker (unregistered) in reply to Quite

    Are we supposed to assume then that in leap years there can be a day of downtime?

  • CodeSlave (nodebb)

    Now now now... it's not called a backhoe. Everyone knows it's a "hydraulic cable finder".

  • FuuzyFoo (unregistered) in reply to DCL

    24/7/364.2425 really...

  • David Nuttall (google) in reply to Jeremy Hannon

    So your manager got to talk with your ISP about the penalty clause of their up-time guarantee.

  • David Nuttall (google)

    This is what is called an unintended single point of failure, when all redundant systems can be taken out in a single action. Thankfully it was identified prior to going live. BTW, grounding 2 huge power feeds through your excavation equipment is not recommended. Not good for the backhoe and not good for the power supply equipment.

  • David Nuttall (google) in reply to FuuzyFoo

    If you want to get picky, it is closer to 365.2422 days/year. We will need to correct our algorithm for leap year placement in about 3000 years.

  • Foobar (unregistered)

    I worked once at a healthcare organization that was (obviously) very concerned about downtimes. Like basically every EMR of the era, ours was using a client/server model. Now, the actual EMR servers were on redundant and backup power supplies, etc., as you would expect. However, it was obvious that nobody actually ever tested the system. At the first power outage, we had the sickening realization that, while the EMR servers were all up and humming along great, the other set of servers, the ones hosting the app virtualization deploying the EMR client, were not on any sort of redundant power. So while technically the EMR server was "up", nobody had a client to actually access it with.

    That whole comedy of errors continued as they redid the data center to put both the EMR and XenApp servers on the same redundant circuit. They got points for remembering to include the climate controls of both server rooms, so we didn't have to worry about the servers roasting themselves alive. However, someone really should have done the math for the total current draw of all devices now on this circuit compared to the current rating of the circuit itself.

  • racekarl (unregistered)

    Back in 2007 when the Internet2 was new (and still newsworthy) it was knocked offline when a homeless man lit a mattress on fire underneath the Longfellow bridge between Boston and Cambridge, MA and melted the fiber cables.

  • L (unregistered)

    I dig this story.

  • Paul (unregistered)

    At work we have a data center with a big UPS, a big generator, a big (and full, and often refreshed) fuel tank, the works...

    It always passed all tests, both the two-yearly end-to-end test where they cut the mains (they usually found a few places where the startup current in some offices was high enough to trip the breaker, but never anything major or unexpected) and the monthly test generator (which always passed flawlessly).

    That is, until there was an actual outage; the generator wouldn't start... As it turns out, the starter battery for the generator had gone bad since the last end-to-end test but combined with its charger there was just enough juice to get the generator to start... Of course, without mains there was no charger to boost the plastic cube containing lead and acid into something resembling a functional starter battery, and therefor no running generator...

  • urkerab (nodebb)

    I think the generator started from battery charger story might have been done before already.

  • Mike (unregistered)

    When we were getting ready to install our servers in a new data center, they proudly showed us their redundant systems. They had room sized Caterpillar diesel generators and a battery backup room that looked like it belonged in a diesel/electric sub.

    When we set up our servers, they asked why our racks had internal UPS, given the data center backups. We always wanted to have some of our redundancy to be based on something we controlled.

    Sure enough, a few years later, the data center lost power, the generators didn't start and the battery backups caught fire. Our racks were the only ones unaffected due entirely to our on board UPS systems.

  • ray10k (unregistered)

    UPS in the loft in case of flooding. So if there's a flood, bugger the expensive mainframe and other equipment, just so long as our UPS is safe.

    These fictional WTFs are terrible.

  • opera mani (google)

Leave a comment on “Who Backs Up The Backup?”

Log In or post as a guest

Replying to comment #:

« Return to Article