• KattMan (cs)

    And that my friends is how you cage the beast!

  • I forgot my posting name (unregistered) in reply to KattMan

    Could be worse, they could have been using it as a saw horse.

  • n9ds (cs)

    That's one good reason to have plain old copper-based analog voice lines...let the phone company worry about keeping the phone lines powered. And frankly, they're pretty damned good at it.

  • tchize (cs)

    makes me remember when a storm, a few years ago, broke big electrical lines in a whole part of France. A network company was running it's data line for clients on generator. They were buying diesel every week to fill the tank at every local network hub. In this devasted part of France, one of the only electrical thing working for nearly a month were switches and routers for network. Of course, nobody was using them because clients didn't have electricity, but the service was available :D

  • Someone You Know (cs)

    I hate it when I can't get bring my server back online!

  • Rugger fan (unregistered) in reply to Someone You Know

    I have heard the previous WTF calls of lame and so forth. But crap on a cracker, this one sucked.

    Boiling it down:

    1. Power failure.
    2. Power was out a long time and UPS lost reserves. (Duh!)
    3. Phone notifications did not work.

    Wow!!! Stop the presses!!! Call the media!!!

    The rest is BS fluff.

    So, this boils down to a WTF of somebody at the COMPANY did not test the system LowestBidder Inc put in for the phone system.

    Geee... I need to go change my pants I was so surprised about that conclusion.

    N.

    (yes, laden with LOTS of sarcasm)

  • WillisWasabi (unregistered) in reply to n9ds

    Yes, never, never, never put your outage paging system on a PBX! And don't use email paging, especially when your email system is down.

    You have no idea how many times I've had to explain those 2 key points to dim-witted managers.

  • Otto (unregistered)

    The end of this story is very ambiguously written.

    One reading of it indicates that the PBX's power was coming directly from the grid and not from the UPS. This is somewhat boneheaded but entirely understandable. It's also very hard to test for, so it's unsurprising that it wasn't caught.

    A different reading indicates that the PBX's phone line was not plugged in to the UPS. This is truly stupid and even a trivial test would reveal the problem. This one would be a true WTF.

    So which one is it?

  • Sgt. Preston (unregistered)

    I give... "cosmic microwave background"?

  • jimlangrunner (cs)

    Okay, gotta ask. If there was that much depending on it, why weren't the generators wired to start automatically when the power went out? UPS only lasts so long.

    I mean, what if the pole outside were hit & took the lines down. Even if the PBX were plugged in, no lines = no phones.

    But I'm a pessimist.

  • KattMan (cs) in reply to jimlangrunner
    jimlangrunner:
    Okay, gotta ask. If there was that much depending on it, why weren't the generators wired to start automatically when the power went out? UPS only lasts so long.

    I mean, what if the pole outside were hit & took the lines down. Even if the PBX were plugged in, no lines = no phones.

    But I'm a pessimist.

    A pessimist is exactly the type of person that should be responsible for these kinds of systems.

    The security head for the world trade centers was laughed at for his tenacity in putting all the safety measures in place and actually requiring evacuation practices. When the time came, many people were saved because they knew what to do.

    Things like this you put in place to handle the worse thing you can think of, then hope you never ever need it.

  • Anonymous Coward (unregistered)
    Solaris does not take kindly to having the lights go out unexpectedly

    Uh-oh, don't let Jörg Schilling hear that!

  • kbiel (unregistered)

    Let's add another WTF to this dog-pile: Why didn't the UPS notify the Sun system to shutdown as it neared exhaustion? I refuse to believe that an expensive 8 hour UPS doesn't include a serial or USB port and software to gracefully shutdown connected equipment.

  • O Thieu Choi (unregistered)

    the magic option "logging" in one's /etc/vfstab can save gobs of time running fsck...

  • Sgt. Preston (unregistered)

    "continuous membrane bioreactor"?

  • FredSaw (cs) in reply to Rugger fan
    Rugger fan:
    Wow!!! Stop the presses!!! Call the media!!!
    Um... if you can stop the presses, you are the media.
  • Sgt. Preston (unregistered)

    "community mailbox"?

  • Zygo (unregistered)

    I admit I've only seen a few PBX systems, but they all had internal batteries, and they do work during power outages (although the runtime depends on how many extensions are active, and you don't get voicemail while the line voltage is down). Maybe larger PBX systems don't work that way...?

    I once built a NOC where the notification server is a laptop configured to send SMS with a cell phone connected by bluetooth (as well as two or three more conventional notification methods using more conventional hardware). As long as the two devices are within 30 feet of each other and the chargers are plugged in, they work well (as confirmed by one test message arriving every week, and all the reports of actual power outages and server problems received over the last several years). The laptop runs for 5 hours and the phone for 53 on top of any runtime provided by an external UPS.

  • Sgt. Preston (unregistered)

    "combat medical badge"?

  • snoofle (unregistered) in reply to Sgt. Preston

    Was this disaster recovery system put into place by the same beaurocrats from the earlier post? Seems like someone missed a meeting about the phone notifications, and it got built wrong...

    Just wondering....

  • Another Anonymous Coward (unregistered) in reply to Anonymous Coward

    LMAO, you made my day!

  • sewiv (unregistered)

    Ridiculous. If you have a generator and a UPS (with 8 HOURS of runtime, which is also ridiculous), there's no excuse for not having an automatic transfer switch. Our UPSes have about an hour of runtime, and seldom use more than 10 seconds of it, since that's how long it takes for the ATS to start the generator, test that it's receiving good power from the generator, and switch over to the generator. When the power comes back, the ATS waits 15 minutes to make sure the power is stable, then switches back and shuts down the generator (after a cooling-off period).

    That's just how it's done. Anyone who installs a system that does otherwise should be fired/sued into non-existence. The ATS is the cheapest part of our setup.

  • Jim Bob (unregistered) in reply to Anonymous Coward
    Anonymous Coward:
    Solaris does not take kindly to having the lights go out unexpectedly

    Uh-oh, don't let Jörg Schilling hear that!

    don't try to do no thinkin' just go on with your drinkin', just have your fun you old son of a gun and drive home in your lincoln

    i'm gonna tell you the way it is, and i'm not going to be kind or easy, your whole attitude stinks i say and the life you lead is completely empty,

    you paint your head, your mind is dead, you don't even know what I just said,

    That's you american womanhood.

    You're phony on top , you're phony underneath, you lay in bed and grit your teeth

    Madge i want your body, harry get back... madge it's not merely physical, oh harry you're a beast <female crying>

    madge I couldn't help it.... awww dogg gonnit

    what's the ugliest part of your body, what's the ugliest part of your body?

    some say it's your nose, some say it's your toes, i say it's your mind

    all your children are poor unfortunate victims of systems beyond their control

  • SuperQ (unregistered) in reply to sewiv

    I was helping out a small ISP that was growing faster than their one sysadmin could handle. I stopped by one morning when the building power was supposed to be out due to power company work. We knew about the outage for weeks ahead of time. Since I was only really there to work on my colo that morning, and the ISP's admin said that he would be there, I didn't plan for much.

    6:00am: power goes out. a few second later, half of their datacenter goes out.. Crap, the UPS was running over capacity, and tripped it's internal mains. The junior admin was around, and I said, damn, that sucks, and we cranked the thing into bypass mode.. but nothing happened. "Why isn't the backup generator running?"

    This is of course Januaryish in Minnesota. I run outside to the generator, of course the cabinet to the control side of the engine isn't locked. The little toggle switch was sitting in the "Off" mode instead of the "Auto" mode. click Vroom! we have power again..

    The real WTF of this story was.. neither the ISP's owner or sysadmin showed up, and the sysadmin slept through the whole thing while we called him ever 5min.

  • Kharkov (unregistered)

    Last time we had a power outage at the local headquarter, we were surprised to see our UPS go down after less than 10 mn when it was supposed to provide power for more than 1 hour, time enough to shut down gracefully all the servers. (The PBX system lasted on its internal batteries for 14 hours)

    Since we are only an office, not a plant like other locations, we didn't had any generator, so all work was suspended till the electrical company restored the lines.

    After checking, we (the IT people, responsible of the UPS system) that the local administrative manager had somehow succeeded in connecting a brand new building directly on the special power lines dedicated to connecting the servers room and the UPS...

    We managed to get some funding for a new UPS and a generator too.

  • sasha (unregistered)

    It's ironic that this was on Solaris. If they used ZFS, they wouldn't have had to worry about data loss or corruption at all.

  • Notta Noob (unregistered) in reply to sewiv
    sewiv:
    Ridiculous. If you have a generator and a UPS (with 8 HOURS of runtime, which is also ridiculous). Our UPSes have about an hour of runtime, and seldom use more than 10 seconds of it, since that's how long it takes for the ATS to start the generator, test that it's receiving good power from the generator, and switch over to the generator. When the power comes back, the ATS waits 15 minutes to make sure the power is stable, then switches back and shuts down the generator (after a cooling-off period).
    "Your faith in your [generator] will be your[ downfall]." (loosely quoting RotJ) It's just super that you have an ATS that allows for a 10-second cutover, quite fantastic, really. Now what happens when you have a generator failure due to, well, pretty much pick any mechanical reason a generator can fail? Do you think an hour will give you enough time to: a) find a replacement generator; b) hire an electrician to re-wire it into your building EP grid; c) fire-up & test the replacement before cutting it over?
    sewiv:
    That's just how it's done. Anyone who installs a system that does otherwise should be fired/sued into non-existence. The ATS is the cheapest part of our setup.
    About the only part of this statement I agree with is the ATS part and related liability. I can tell you that "how it's done" is entirely dependant on your company's continuity objectives.

    As the IT Manager for a scheduled airline, my objective was keep things running as long as possible. So yes, we had our key systems connected to the building EP Grid, with a 5 second transfer to gen-power at loss of building mains, on top of that we had a UPS monstrosity that would run all of our servers, network gear, phones, and key workstations for 8 hours. At about the 4 hour mark on the UPS, our managed power bars would start shutting off power to all non-essential peripheral equipment. At the 2 hour mark, non-essential servers would begin the shutdown process. At the 1 hour mark, only two servers, two switches, two routers, one workstation and the phone system would remain powered. At the half hour mark, the phone system would have power removed and work off of its internal batteries (good for 2 hours) and the UPS would run down to zero.

    Overkill? Well, depends on whether you'd like to see the company monitoring your aircraft have some idea of where those aircraft are and be able to communicate with them. The process outlined above could theoretically extend the time of power delivery from 8 hours to just under 24. As for us, we never experienced an outage beyond 6 hours, but the 6 hour outage we did experience was due to, of all things, a generator failure while the building was blacked out :)

    Is it ironic or just entertaining that the captcha for this post is 'alarm'?

  • complich8 (unregistered) in reply to sasha

    "That's not irony, that's just mean"

  • mikko (unregistered) in reply to kbiel
    kbiel:
    Let's add another WTF to this dog-pile: Why didn't the UPS notify the Sun system to shutdown as it neared exhaustion? I refuse to believe that an expensive 8 hour UPS doesn't include a serial or USB port and software to gracefully shutdown connected equipment.

    Doesn't matter if they have it, if: I once worked at a company that refused to allow the sysadmin (me) to load the UPS monitoring software onto our servers, because "there were other things to do". Along comes a power failure, down (crash) goes 72 servers after the UPS failed (the generator was inoperative). The boss started to "assign responsibility" for the damaged hardware and corrupted data to me - the result? I went and had a chat with the company legal council, the chief financial officer, and the company president, (they were standing around in a panic) and then walked away (quit) and left my ex-boss to sort out the mess - happily for me, ('cuz I'm a nasty guy) the company lost several of their major accounts, and two quarters later closed their doors for good. Cost them a ton of money for those "failure to perform" contract clauses. From what other employees later told me, they never did get all their systems back up - the ex-boss tried to use desktops to replace the damaged servers...

  • Top Cod3r (unregistered)

    Not sure why my on-topic post got deleted while the irrelevant "i'm gay/you're gay" posts remain. I guess they want to filter out any dissenting comments.

    The real WTF is whis site doesn't even know what a real WTF is anymore.

    Yes we all get it, companies have strict policies, people make mistakes. So what? Why don't you once again show us something that makes us go WTF, rather than just a bitch session about various companies red tape, or silly error messages with typos in them.

  • Bill (unregistered)

    Sasha -- thanks for the ZFS comment. Exactly right. Solaris is sexy, but ZFS, aka IBM's z/OS File System, is superstable.

  • Mitchell T (unregistered) in reply to Bill

    So is UFS with the logging option enabled. If you have a semi-recent solaris install, oh say solaris 8 and above. Which is only about 8 years old by now mind you. This isn't an issue.

    That is one wtf right there. Patching is another one. While not mentioned I bet the os hasn't been patched since it was installed. Which only means you can hit fun little data corruption bugs.

    I am assuming this was running ufs with a large (>100G) filesystem. I don't know if solstice/svm was the volume manager. But given how cheap they seem to be vxfs/vxvm is out of the question. That is even better than ufs if you can afford it.

    And zfs means the zeta byte filesystem. Not any of IBM's stuff. zfs for solaris has only been stable, read the source code if you don't believe me. Wait, you can't do that for the z/os stuff doh!

    How much does a z/OS server run these days? Would a million even get me an intro machine?

  • db2 (cs)

    We had something slightly similar happen to us once.

    We've got an Exchange server on site. Our "plan B" involves a large mail spooling/spam filtering service that you've likely heard of. Our mail server goes down, no problem. Their machines spool it until ours comes back up, at which point it delivers the payload to us. We may lose internal correspondence abilities for a little while, but customers can send mail to us without it bouncing, and the world keeps turning.

    Likewise, we've got DNS on site. We also have redundant DNS hosted off site at our ISP. It zone transfers, and all is well.

    And then came the week(s) of server room reconfiguration. During the consolidation, some unracked machines were migrated onto racks. Our DNS server just so happened to be one such machine. It also happened to be nothing more than an average desktop machine (a WTF in and of itself). Can you guess what happens when a tech connects a 120V desktop power supply to a 240V PDU fed from a very large UPS?

    Suffice it to say, DNS was no more. But hey! We've got off site DNS! We don't need to worry about that until Monday! Well, it turns out the TTL on our zone transfers was a bit lower than expected. Let's just say that all the mail spooling contingency plans in the world don't mean dick if you can't even get an MX record from DNS in the first place.

    The moral of the story: a chain is only as strong as its weakest link. That was a fun morning of faking my way through rebuilding a DNS machine. I became experienced with it real quick. :P

  • Russ (unregistered) in reply to db2
    db2:
    We had something slightly similar happen to us once.

    We've got an Exchange server on site. Our "plan B" involves a large mail spooling/spam filtering service that you've likely heard of. Our mail server goes down, no problem. Their machines spool it until ours comes back up, at which point it delivers the payload to us. We may lose internal correspondence abilities for a little while, but customers can send mail to us without it bouncing, and the world keeps turning.

    Likewise, we've got DNS on site. We also have redundant DNS hosted off site at our ISP. It zone transfers, and all is well.

    And then came the week(s) of server room reconfiguration. During the consolidation, some unracked machines were migrated onto racks. Our DNS server just so happened to be one such machine. It also happened to be nothing more than an average desktop machine (a WTF in and of itself). Can you guess what happens when a tech connects a 120V desktop power supply to a 240V PDU fed from a very large UPS?

    Suffice it to say, DNS was no more. But hey! We've got off site DNS! We don't need to worry about that until Monday! Well, it turns out the TTL on our zone transfers was a bit lower than expected. Let's just say that all the mail spooling contingency plans in the world don't mean dick if you can't even get an MX record from DNS in the first place.

    The moral of the story: a chain is only as strong as its weakest link. That was a fun morning of faking my way through rebuilding a DNS machine. I became experienced with it real quick. :P

    I'm confused on what you mean by TTL on the zone transfers was too low? I'm not too sure how zone transfer work, but does it mean that it just replicates the dns info to the off site dns server? If the server is down, it cant' replicate anything, now can it/

  • $ (unregistered)

    Today's WTF provides valuable information.

    A ton of other WTF stories have involved consultants who bill $300/hour, big famous expensive consulting companies, big famous expensive vendors, etc.

    Today's WTF proves that Lowest Bidder Inc. gets the same results as the big guys. You don't have to pay enormous fees.

  • $ (unregistered)

    To Russ:

    The DNS transfer worked. The TTL for the DNS transfer told the ISP how long to use their copies of the DNS records. When the TTL expired, the ISP obediently stopped using their copies (including the MX record).

  • operagost (cs) in reply to Notta Noob
    Notta Noob:
    "Your faith in your [generator] will be your[ downfall]." (loosely quoting RotJ) It's just super that you have an ATS that allows for a 10-second cutover, quite fantastic, really. Now what happens when you have a generator failure due to, well, pretty much pick any mechanical reason a generator can fail? Do you think an hour will give you enough time to: a) find a replacement generator; b) hire an electrician to re-wire it into your building EP grid; c) fire-up & test the replacement before cutting it over?
    You can get a diesel generator installed in eight hours? Quite fantastic, really.
  • Rich (unregistered) in reply to FredSaw
    FredSaw:
    Rugger fan:
    Wow!!! Stop the presses!!! Call the media!!!
    Um... if you can stop the presses, you are the media.

    A medium, surely?

  • Tez (unregistered)

    "WorseThanFailure"? Nah, this story sounds much like just an ordinary failure.

  • Raggles (cs) in reply to Rugger fan
    Rugger fan:
    I have heard the previous WTF calls of lame and so forth. But crap on a cracker, this one sucked.

    Boiling it down:

    1. Power failure.
    2. Power was out a long time and UPS lost reserves. (Duh!)
    3. Phone notifications did not work.

    Don't forget: 4) In the end, it turned out OK 'cos the server wasn't really that busted. 5) The end.

  • probabilities (unregistered) in reply to operagost
    operagost:
    Notta Noob:
    "Your faith in your [generator] will be your[ downfall]." (loosely quoting RotJ) It's just super that you have an ATS that allows for a 10-second cutover, quite fantastic, really. Now what happens when you have a generator failure due to, well, pretty much pick any mechanical reason a generator can fail? Do you think an hour will give you enough time to: a) find a replacement generator; b) hire an electrician to re-wire it into your building EP grid; c) fire-up & test the replacement before cutting it over?

    You can get a diesel generator installed in eight hours? Quite fantastic, really.

    At least you have 800% more opportunities to get it.

  • JustSomeone (unregistered) in reply to sasha
    sasha:
    It's ironic that this was on Solaris. If they used ZFS, they wouldn't have had to worry about data loss or corruption at all.

    Assuming server-grade hardware (and not toy ATA disks with write caches enabled), it should handle things just fine with other file systems, as well.

    However, if it was an oldish server with lots of disks, I can see this kind of situation causing problems. Failing to spin up after being powered down is a common failure mode in disks that were otherwise fine, and it might even happen to several simultaneously after the system has been running continuously for years.

  • Aurora (unregistered) in reply to operagost

    We can actually get one of those container sized 400 kVA generators up and running in 2 hrs. But then, we A. have those on site, B. have experience doing so and C. we ARE the electricity company.

  • poochner (cs) in reply to Aurora
    Aurora:
    ... and C. we ARE the electricity company.

    That's cheating!

  • KG2V (unregistered) in reply to n9ds

    Yep - I was going to say - that's what 1 hard wired POTS line is for. Standard disaster recovery stuff

  • valerion (cs)

    We had a power outage yesterday and the backup generator wouldn't start, resulting in some major stuff going down.

    I guess I can make the front page, too!

  • s (unregistered) in reply to SuperQ
    SuperQ:
    The real WTF of this story was.. neither the ISP's owner or sysadmin showed up, and the sysadmin slept through the whole thing while we called him ever 5min.

    Feel lucky. Gateway goes out friday afternoon. "Sorry, but the service people went home already. They will fix it in monday".

  • Morty (unregistered) in reply to Russ
    Russ:
    db2:
    Suffice it to say, DNS was no more. But hey! We've got off site DNS! We don't need to worry about that until Monday! Well, it turns out the TTL on our zone transfers was a bit lower than expected. Let's just say that all the mail spooling contingency plans in the world don't mean dick if you can't even get an MX record from DNS in the first place.

    I'm confused on what you mean by TTL on the zone transfers was too low? I'm not too sure how zone transfer work, but does it mean that it just replicates the dns info to the off site dns server? If the server is down, it cant' replicate anything, now can it/

    If a secondary DNS server transfers a zone from a primary server, it is supposed to periodically query the primary to make sure the data is still current. If the primary goes down, the secondary may continue to serve the zone data, but only for an amount of time indicated in the zone's SOA record which is called the zone expiration time. After zone expiration, the secondary is supposed to throw away the data. I think this is because the secondary cannot tell if the problem is that the primary is not allowing it to do zone transfers, in which case the secondary's data might be stale, or if it's just that the primary is down, in which case the secondary remains the best source of the data.

    "TTL" in the DNS world means something else.

  • Jim Bob (unregistered) in reply to Top Cod3r
    Top Cod3r:
    Not sure why my on-topic post got deleted while the irrelevant "i'm gay/you're gay" posts remain. I guess they want to filter out any dissenting comments.

    The real WTF is whis site doesn't even know what a real WTF is anymore.

    Yes we all get it, companies have strict policies, people make mistakes. So what? Why don't you once again show us something that makes us go WTF, rather than just a bitch session about various companies red tape, or silly error messages with typos in them.

    SAND IN VAGINA

  • s (unregistered) in reply to operagost

    10 minutes for decision from the boss, 5 minutes to find an employee with a pickup truck, half a hour drive to the store, 5 minutes to find it and get it to the car, 10 minutes of getting the invoice, another half a hour to get it back to the site, with 5-minute stop to buy fuel. Setting up a portable generator (gasoline, not diesel) takes maybe 15 minutes. Installing it is as easy as plugging the UPS into it.

    This is all providing you get a blank cheque from the boss. Otherwise, purchase approval can take up to a month.

    It's not a hi-tech solution, but if you want it fast and improvise, you don't have time for hi-tech.

Leave a comment on “Paging Dr. UPS”

Log In or post as a guest

Replying to comment #:

« Return to Article