• (cs)

    When all logical steps fail, reboot.

    If that fails, wipe & reload.

    And if THAT fails...

  • (cs)

    Should there not be a label somewhere on the box that says "MISSION CRITICAL SERVER" or some thing? Shouldn't the test server be labeled as such? The real WTF is that these people don't know what labels are.

  • (cs)

    would have been even funnier if the hardware guy would have formatted the drive thinking it was just a test server and he was trying to reconfigure the whole thing. Then a whole day of trades would have been lost because somebody didn't label the server and the "critical" server got wiped out. And if they didn't have a backup cd on hand....

  • (cs)

    That's a pretty epic WTF.

  • DKleinsc (unregistered)

    And we all learn a valuable lesson: Put clear labels on all servers, and possibly even a map to each server in a cabinet on the door of the cabinet.

  • Keaton (unregistered)

    Hilarious. And far, far, far too commonplace.

  • Robert Hanson (unregistered)

    This is a mission-critical server, and they don't have a hot standby? If the server fails for ANY reason, it should automatically fail over to the standby. (I think we all know what needs to be done for the standby - separate power source, separate network to a seprate backbone, along with a third hot standby located halfway across the country...)

    Relying on any one single piece of equipment to operate your business is foolhardy at best. I hope these "upper management" never have to explain to the board of directors that the computer died and so we lost a whole days revenue.

  • Anon (unregistered)

    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.

  • (cs) in reply to Anon
    Anon:
    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.

    This is an investment in leverage. The 'ole "YOU ARE MY B*TCH" scenario. I can only imagine that there's some benefit, or that Michael is some uber "nice person (tm)"

  • (cs) in reply to Robert Hanson
    Robert Hanson:
    This is a mission-critical server, and they don't have a hot standby? If the server fails for ANY reason, it should automatically fail over to the standby. (I think we all know what needs to be done for the standby - separate power source, separate network to a seprate backbone, along with a third hot standby located halfway across the country...)

    Of course, then you end up with the old "seperate power source s plugged into the same power strip" problem. I've seen it. It made me laugh.

  • whicker (unregistered)

    Too many real wtf's in this story.

    If Michael is so important, what is he doing living an hour away? Or rather, an hour's drive at 2 AM away...?

  • (cs)

    I suspect the anonymization ruined this story. Either that or every test/diagnostic was effectively worthless.

    I trust they invested in a monitoring solution after this incident.

    Note from Alex: Like most, this story wasn't anonymized beyond redacting key details (location, names, etc). The only things that get really "anonymized" are the "big systems" and code samples -- they're too specific and far too recognizable to a lot of people, so I have no choice but to change them. This story is, sadly, universal!

  • Sgt. Preston (unregistered) in reply to Anon
    Anon:
    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.
    <stereotype> Probably a six-pack of jolt and a dozen Mars bars. </stereotype>
  • (cs) in reply to Sgt. Preston
    Sgt. Preston:
    Anon:
    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.
    <stereotype> Probably a six-pack of jolt and a dozen Mars bars. </stereotype>

    In general, by the time you've graduated to being the 'last resort' at 2AM for a large company, you've also graduated from Caffeine During the day to massive amounts of Alcohol at night.

  • Michael (unregistered) in reply to nerdydeeds
    nerdydeeds:
    Should there not be a label somewhere on the box that says "MISSION CRITICAL SERVER" or some thing? Shouldn't the test server be labeled as such? The real WTF is that these people don't know what labels are.
    No, the test server was very clearly labeled "Orange".
  • stupid old me (unregistered) in reply to BiggBru
    BiggBru:
    When all logical steps fail, reboot.

    If that fails, wipe & reload.

    And if THAT fails...

    You mean that's not the first steps? :-)

  • last resort (unregistered) in reply to s0be
    s0be:
    Sgt. Preston:
    Anon:
    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.
    <stereotype> Probably a six-pack of jolt and a dozen Mars bars. </stereotype>

    In general, by the time you've graduated to being the 'last resort' at 2AM for a large company, you've also graduated from Caffeine During the day to massive amounts of Alcohol at night.

    There are some days I wouldn't be able to handle massive amounts of alcohol at night if I didn't have the massive amounts of caffeine during the day...

  • user (unregistered) in reply to nerdydeeds
    nerdydeeds:
    Should there not be a label somewhere on the box that says "MISSION CRITICAL SERVER" or some thing? Shouldn't the test server be labeled as such? The real WTF is that these people don't know what labels are.

    Two words: Orange Cable

    CAPTCHA: onomatopoeia (WTF?!)

  • (cs) in reply to Anon
    Anon:
    The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.

    I'd say Michael got some pretty good credit out of the situation. Even though he said "we just rebooted it", Management will still know that when he came in, the problem magically fixed itself.

    They think he's got some magic touch. Or maybe there was some complex fix but he knew he was so smart that they'd never understand, so he didn't even try. Look at that modest guy, saving the company and acting like it's nothing. Let's give him a raise. ...Not like that punk hardware engineer who just follows people around and nods looking guilty.

    On second thought.... this probably just means he'll get more calls at 2AM.

  • john (unregistered)

    Who'd have thought it possible for an entire physical server to be 410...

  • GrandmasterB (unregistered)

    Here's how the story would have gone if it was me...

    2 am: rrriiinnngggggg

    me: <unplugs phone>

    Seriously, what kind of tool goes into the office at 2am? F- that.

  • (cs) in reply to nerdydeeds
    nerdydeeds:
    Should there not be a label somewhere on the box that says "MISSION CRITICAL SERVER" or some thing? Shouldn't the test server be labeled as such? The real WTF is that these people don't know what labels are.

    Label?!?!? If it's mission critical to your business and in control of your customer's money it should be locked behind several doors that require everyone's authority to open (or at least to power down).

    And why is there only one server for the "mission critical" function? And I'd assume it's on something <facetious>stable like Windows NT</facetion>.

    Bah!

  • waffles (unregistered) in reply to GrandmasterB
    GrandmasterB:
    Here's how the story would have gone if it was me...

    2 am: rrriiinnngggggg

    me: <unplugs phone>

    Seriously, what kind of tool goes into the office at 2am? F- that.

    The kind that's paid several hundred dollars an hour to be available at 2am.

  • Jerim (unregistered)

    Best thing about the story, is that he covered for an engineer who made an honest mistake. That is the type of guy I think we would all like to work for. Of course, the second time that happens, I would hang the engineer out to dry.

  • foo (unregistered) in reply to waffles
    waffles:
    GrandmasterB:
    Here's how the story would have gone if it was me...

    2 am: rrriiinnngggggg

    me: <unplugs phone>

    Seriously, what kind of tool goes into the office at 2am? F- that.

    The kind that's paid several hundred dollars an hour to be available at 2am.

    True. My buddy gets paid extra to be the one on call 24/7. If he can't fix it, then I get called.

  • Chris Harmon (unregistered)

    Sadly too realistic... but so true. Now if only I were paid that much $$ when I had to do it...

    I had that setup for a prior contract and until I came on, most of the servers were unlabeled like that. Glorious stuff.

    I recall having an "incident" once that was vaguely similiar - I was installing a (not-so-critical) data collection workstation in a power plant's control room, and somehow a cable connected to a critical data collection workstation right below became loosened. Lord knows why it wasn't actually screwed on, but the operators lost access to part of the control system. Boy did I get reamed over it. Luckily for me, the guy who did the implementation happened to be on-site that day and knew exactly what would be the cause.

    Yeah and he had the maintenance team actually screw down the cable (and check all other workstations in the control rooms) afterwards too :)

  • dolo54 (unregistered) in reply to GrandmasterB
    GrandmasterB:
    Here's how the story would have gone if it was me...

    2 am: rrriiinnngggggg

    me: <unplugs phone>

    Seriously, what kind of tool goes into the office at 2am? F- that.

    Bud Bundy is that you?

  • Been There (unregistered) in reply to Jerim
    Jerim:
    Best thing about the story, is that he covered for an engineer who made an honest mistake. That is the type of guy I think we would all like to work for. Of course, the second time that happens, I would hang the engineer out to dry.
    I've covered for coworkers who have made mistakes like that (be honest, who among us hasn't done "rm -rf *" from the wrong directory?) - the first time.

    Of course, if there wasn't a backup (server) already in place, I made it my business to inform the upper brass, at 3AM, whether they were at the office, or in bed, that they had better spring for $$$ or they would be getting more 3AM "alerts" - and it invariably worked, too!

  • Dave C. (unregistered) in reply to nerdydeeds

    In fact, there was a label, but it simply said "Orange cable."

  • Leaky Abstractions (unregistered) in reply to john
    john:
    Who'd have thought it possible for an entire physical server to be 410...

    The first clue that my friend in Chicago got that his apartment had been broken into was a "host not found" error connecting to his home computer, which eventually was revealed to mean the host had been stolen. Yes, he had backups.

  • (cs) in reply to s0be
    s0be:
    In general, by the time you've graduated to being the 'last resort' at 2AM for a large company, you've also graduated from Caffeine During the day to massive amounts of Alcohol at night.
    I ended up doing both - shitloads of coffee all the way through the day, and a couple of pints on the way home to counteract the coffee. Self-medication for recalcitrant corporate drones, part 7. :)
  • (cs)

    I've been there, but usually not so bad. I remember one time when working around an AS/400 with the communications lines trying to figure them out. Finally figured out which lines went to which offices, saw an "extra" line up, hmm.. must just be an old configuration and disabling it.

    5 minute later I get a phone call, "This is the Atlanta office. Our connection just went down." "Oh, okay, let me check it." Reactivated it, "Thanks, it's up now. What was wrong?" "Not sure, just powercycled it." Hung up. Then walked out of my office, "Why didn't anyone tell me we had an office in Atlanta?"

  • (cs) in reply to Serpardum
    Serpardum:
    [...] Then walked out of my office, "Why didn't anyone tell me we had an office in Atlanta?"

    Classic.

  • Joe Van Dyk (unregistered)

    A monitoring system like nagios that does heartbeats on all the servers would've saved the day here.

    "Oh, look, the production server isn't responding to anything."

  • Anonymous Coward (unregistered) in reply to GrandmasterB
    GrandmasterB:
    Here's how the story would have gone if it was me...

    2 am: rrriiinnngggggg

    me: <unplugs phone>

    Seriously, what kind of tool goes into the office at 2am? F- that.

    The story would have been most boring, if it had gone that way.

  • Smart Tech Guy (unregistered) in reply to nerdydeeds

    They probably should have used little orange labels that stated 'not an orange cable but a production server'.

  • Anonymous (unregistered)

    "And attempted to resynching the connection."

    [image]
  • DBGuy (unregistered) in reply to OneMHz

    The company I used to work for required two seperate power sources for each server. Even some servers we ordered with only one power supply because they were not mission critical and utterly expendable were retrofitted with a second power supply just to be on the safe side.

    One day, I get paged because all servers were gone from the net. All of them. I drive to the datacenter...strange. All of the servers are reporting a power supply failure, but are still operating.

    Turned out that all switches did not have a second power supply because "you can always have them plug the servers into the other switch if one fails". Too bad that in the meantime all of the ports on "the other switch" were already in use, and management had never approved the budget for the remote hands service provided by the datacenter personell, so one of us had to drive 100 kilometers to fix the problem.

  • Mark B (unregistered)

    does anyone else think its weird that they didnt have a fail over system ?

    where I used to work we had overkill fail over multiple failover clusters and a disater decovery cluster at a remote location in case the building was bombed. still it was a fund management company.

  • test dummy (unregistered)

    Agreed, the missing label is clearly the WTF here. My former employer had pretty nasty server room with all kinds of boxes everywhere, but they were all labeled. Even the most temporary test box.

  • Stefano Grevi (unregistered)

    Michael just saved the butt of his Hardware Engineer.

  • (cs) in reply to whicker
    whicker:
    Too many real wtf's in this story.

    If Michael is so important, what is he doing living an hour away? Or rather, an hour's drive at 2 AM away...?

    We are, most likely, talking about New York here. The office would most likely be in the financial district in the south of Manhattan. You just don't find housing less than an hour from there. It exists, but is either already occupied, or costs seven plus digits.

  • (cs)

    We label all of our servers with a project name, project number, and hostname. We don't need labels that say something like "PRODUCTION!" or "This server is SUPER DUPER important". There's an assumption that before you disconnect a server you better know what it's doing and who is using it OR you better find out. Of course, our hostnames indicate whether it is a test, dev, integration, model office, or production server anyways.

  • Charlie (unregistered) in reply to OneMHz
    OneMHz:
    Of course, then you end up with the old "seperate power source s plugged into the same power strip" problem. I've seen it. It made me laugh.

    Indeed!

    I ran into a good one a while back with a bunch of co-located boxes, with a complete redundant network stack to ensure nothing could go wrong. The hosting company put all the network gear into the same cabinet. And put absolutely everything behind the same circuit breaker.

    Needless to say, after a minor power glitch took out every single server at the same time we (a) shouted very loudly at their technical guys, and (b) rapidly arranged to move the whole goddamn lot to a different facility.

    Moral of the story is, different companies have different definitions of the word "professional". Don't entrust mission-critical systems to a third party unless you get the chance to double-check everything yourself.

  • O&APartyRock (unregistered) in reply to pitchingchris
    pitchingchris:
    would have been even funnier if the hardware guy would have formatted the drive thinking it was just a test server and he was trying to reconfigure the whole thing. Then a whole day of trades would have been lost because somebody didn't label the server and the "critical" server got wiped out. And if they didn't have a backup cd on hand....

    That happened at a client I was working at. One of their tech guys formated the source control box. In itself that wouldn't have been disasterous other than the fact that they didn't have any real backup procedures in place. Eventualy, after about a week without source control, the same tech managed to recover all the data on the box. Then he had the audacity to be upset that he didn't get praise for recovering the data he had destoryed.

  • Corporate Cog (unregistered)

    Seems unlikely that the hardware engineer wouldn't have considered that he removed the production server rather than the test server the minute the problem was detected.

  • mrs_helm (unregistered)

    The REAL WTF is that the hardware engineer had removed the machine before he left for the day (as evidenced by the fact that Michael had to call him back IN), but nobody NOTICED until 2AM. On a "mission critical" system. That means if the hardware eng left at 5pm, it was 9 hrs later. Heck, even if he was working until 10pm that night, it was 4 hrs later...which is pretty bad...

  • Ollie (unregistered) in reply to Jerim

    It was nice of the guy to cover for the hardware engineer's screwup. But, the real WTF is an organization that's so punitive that it's necessary to cover for these mistakes. You KNOW that hardware engineer will never make that kind of mistake again (unless he's dumber than a bucket of gravel).

    So, fire him for the screwup, and let your competitor reap the benefit of his hard-won experience. Yeah. That's it.

  • (cs) in reply to mrs_helm
    mrs_helm:
    The REAL WTF is that the hardware engineer had removed the machine before he left for the day (as evidenced by the fact that Michael had to call him back IN), but nobody NOTICED until 2AM. On a "mission critical" system. That means if the hardware eng left at 5pm, it was 9 hrs later. Heck, even if he was working until 10pm that night, it was 4 hrs later...which is pretty bad...

    On the other hand, this is about the stock market; possibly no one uses the server when the exchanges aren't open.

  • UH OH (unregistered)

    First-time mistakes get overlooked... Damn I wish I worked for such an organization... The scope of damage the hardware engineer could have done versus saving his job for being a dumbass...

    So what if the server was label-less and unsecure? It is not like the server was placed in the middle of a sidewalk and that everyone could take a jab at it. He's supposedly an ENGINEER, he should have known better. "Err, I, err, isn't this the test server?" stfu. Give him a labeled server and he still would have moved it. He's lost all credibility. He made a grave mistake, it's not like spilling coffee on your desk.

    Ignorance is not an excuse.

    "Fvck, who knew that big red unlabeled button was the nuke missile launcher? Not my fault, it was not labeled."

Leave a comment on “Broken Communication”

Log In or post as a guest

Replying to comment #:

« Return to Article