• Jason Roelofs (unregistered)

    WTF?!

    I hope that HVAC company got a bill for all the destroyed hardware.

    Wow.
     

  • Philbert Desanex (unregistered)

    Server Technician:  "We thought it would be easier to work on both nodes of the cluster at once."

  • Evan M. (unregistered)

    Just so I get this striaght: Phil is a developer who was part of the dev team that created / updates / etc. the system, and who also submitted the story. And then we have Mark, who is actually the protagonist in our teppid tale?

     And I think the real WTF here is that there wasn't any service agreements with the hosting building about how access to these systems / site is controlled for maintenance purposes. Or if there is, who had their head chewed off from this major failure in following the proper procedures.
     

  • Daniel (unregistered) in reply to Jason Roelofs
    Anonymous:

    WTF?!

    I hope that HVAC company got a bill for all the destroyed hardware.

    Wow.
     

     

    Not to mention the lost income.

     

    Captcha: tango, as in whiskey, tango, foxtrot
     

  • ammoQ (cs)

    Reminds me when all the servers and their backup failed - during business hours - because the electrician accidentally flipped the protective switch in the server room. After that, the company learned that their "highly-available" system doesn't really like when both servers (production and standby) go down (and restart) concurrently, so they spend another couple of hours getting the system up again.

  • obediah (cs)

    nice work there.


    Reminds me of my last job where we found a physical plant guy sitting on top of our STK silo to muck about in the ceiling. While he was arguing with my boss about he "weren't hurtin nuthin'", all I could imagine was that robot (which spins at quite a clip), and how cool it would be to see Asimov's first law broken. 


    The moral of the story ended up being, it doesn't matter what gauge steel something is made out of, you don't sit on top of a $1M+ piece of very delicate machinery to change lightbulbs or whatever.


     

  • captcha: hacker (unregistered)

    WTF? How did the HVAC guys get the automatic secure doors propped open without anyone know about this?

  • R.Flowers (cs)

    Personally, I was rooting for the giant robot.

    Hey, give the HVAC guys a break. They're interested in making their jobs easier, not making your job easier!

     ; )
     

  • Toger (cs) in reply to Evan M.

    So all the machines capable of performing these time-sensitive operations were hosted at the same location? For the level of redundancy listed they should have had a second datacenter.

    AND, who let the AC guys in and left them there without supervision to a bank's machine room? Why weren't there machine room temperature monitors, or machine-overheat alerts?

  • captcha=pizza (unregistered)
    Alex Papadimoulis:

    ...which could only mean one thing: a bomb, a fire, or a giant robot wreaking havoc throughout the city.

    How can you simply rule out a T-Rex, King Kong or even Mothra? Come on now, get real!

  • doc0tis (unregistered) in reply to R.Flowers

    Hey, give the HVAC guys a break. They're interested in making their jobs easier, not making your job easier!

    Unfortunatly this is true, and they may not have known any better. The bank probably should have had some IT staff onsite.

     

    --doc0tis 

  • been there, endured that (unregistered) in reply to Toger
    Toger:

    So all the machines capable of performing these time-sensitive operations were hosted at the same location? For the level of redundancy listed they should have had a second datacenter.

    AND, who let the AC guys in and left them there without supervision to a bank's machine room? Why weren't there machine room temperature monitors, or machine-overheat alerts?

    Hey, give the AC guys some credit - they could just as easily have coordinated taking down both data centers simultaneously

  • Unklegwar (unregistered) in reply to doc0tis
    Anonymous:

    Hey, give the HVAC guys a break. They're interested in making their jobs easier, not making your job easier!

    Unfortunatly this is true, and they may not have known any better. The bank probably should have had some IT staff onsite.

     

    --doc0tis 

    In this day and age (eg, the last 30 years), how can any HVAC contractor that does commercial work NOT know the implications of working on AC in a computer server room? 

     

    CAPTCHA: paul

     

  • stupid things we once did (unregistered) in reply to been there, endured that

    I worked at a place where the on-site data center (used for development only, not UAT, prod or DR) was perceived as super-critical, and had to be kept up and climate controlled at all times. To this end, they had this MASSIVE (15 feet long, 3 feet deep, 8 feet high) heater/A-C/[de]humidifier that was WAY oversized for the room. Then, just in case it broke down, there was another one on the other side of the room. Every once in a while, on a Friday afternoon, we'd turn both units on (we could override the auto-controls), set the temp to the minimum, and override the humidifiers to run at full blast, and leave both running that way over the weekend. If it was cold enough outside, the inner building temp got down to about 55 on the weekend (full shutdown), so the computer room would actually have a layer of frost on everything by Monday morning. Everything was shielded, so nothing ever got damaged from the water, but it made for a great indoor snowball fight.

  • Kevin (unregistered) in reply to Evan M.
    Anonymous:

     And I think the real WTF here is that there wasn't any service agreements with the hosting building about how access to these systems / site is controlled for maintenance purposes. Or if there is, who had their head chewed off from this major failure in following the proper procedures.

    I think the real WTF is that a company would keep their backup servers in the same room as the real servers.  What if a circuit breaker reset and cut power to the entire room?  There are countless things that could happen within a single room to cause both the main and backup systems to fail.

  • mjan (unregistered) in reply to stupid things we once did

    Anonymous:
    I worked at a place where the on-site data center (used for development only, not UAT, prod or DR) was perceived as super-critical, and had to be kept up and climate controlled at all times. To this end, they had this MASSIVE (15 feet long, 3 feet deep, 8 feet high) heater/A-C/[de]humidifier that was WAY oversized for the room. Then, just in case it broke down, there was another one on the other side of the room. Every once in a while, on a Friday afternoon, we'd turn both units on (we could override the auto-controls), set the temp to the minimum, and override the humidifiers to run at full blast, and leave both running that way over the weekend. If it was cold enough outside, the inner building temp got down to about 55 on the weekend (full shutdown), so the computer room would actually have a layer of frost on everything by Monday morning. Everything was shielded, so nothing ever got damaged from the water, but it made for a great indoor snowball fight.

     Was this place in northern VA, perchance?  If so, I think I participated in one of those snowball fights.
     

  • rmg66 (unregistered) in reply to stupid things we once did

    Anonymous:
    I worked at a place where the on-site data center (used for development only, not UAT, prod or DR) was perceived as super-critical, and had to be kept up and climate controlled at all times. To this end, they had this MASSIVE (15 feet long, 3 feet deep, 8 feet high) heater/A-C/[de]humidifier that was WAY oversized for the room. Then, just in case it broke down, there was another one on the other side of the room. Every once in a while, on a Friday afternoon, we'd turn both units on (we could override the auto-controls), set the temp to the minimum, and override the humidifiers to run at full blast, and leave both running that way over the weekend. If it was cold enough outside, the inner building temp got down to about 55 on the weekend (full shutdown), so the computer room would actually have a layer of frost on everything by Monday morning. Everything was shielded, so nothing ever got damaged from the water, but it made for a great indoor snowball fight.

    ARE YOU FREAKIN' CRAZY!!! 

     

  • Phil (unregistered) in reply to Toger
    Toger:

    So all the machines capable of performing these time-sensitive operations were hosted at the same location? For the level of redundancy listed they should have had a second datacenter.

    AND, who let the AC guys in and left them there without supervision to a bank's machine room? Why weren't there machine room temperature monitors, or machine-overheat alerts?

    Actually we did have a second data center but development could not be done there.  And the AC guys were bank employees too, they just reported up a WAY different hierarchy, which, by the way, was the group that owned the security doors, etc.  The temperature monitor alerts were some of the early ones Mark got on his way downtown.  They usually just indicated an overheated box, not a complete meltdown. 

  • Steeldragon (cs) in reply to stupid things we once did

    Ok...this story is just 2 funny...a couple of critical servers are accidently put offline because 2 AC guys work on the 2 ACs in the room at the same time...just hilarious...

  • XH (unregistered)

    Sometimes I think people get what is coming to them.  


    Q: Why did the server farm monitoring solution not recognize the sustained component temperature and alert somebody before servers started failing completely?  

    A:  Absent, incomplete, or misconfigured monitoring solution.

    Q:  Why were the servers not instructed to gracefully shut down just before the point at which component damage occurs?  It is ALWAYS less costly to have a server gracefully remove itself from service with a two minute warning, but remain bootable, than it is to have a server fail in the red zone and require physical or software maintenance to restore.

    A:  Over-reliance on the monitoring system or misconfigured servers.  In a story about backups for backups, it is odd that the monitoring system did not have a backup.

    Any sufficiently large system that can't be brought back up within three hours is improperly managed and configured.  At one point in time, I worked for a company that had a larger linux cluster than google (early google, mind you) and we could go from catastrophic power failure to serving customers in less than an hour.  You see, in a system (power grid) with a backup (diesel + batteries), it doesn't hurt to have an additional backup (good "return to service" planning).

    One place I worked, a large educational publisher, a day-shift sys-admin just retired with 20 years of service.  The CTO (an adulterous, weak-hearted, ex-Enron weenie, but that's not important) boasted that the sys-admin had worked countless nights bringing the in-house servers back online during various power outages over the years (CPS power in San Antonio goes out just about everytime we get a good storm).  I remember thinking to my self, "If the sys-admin were not so damned incompetent, he wouldn't have had to work all those nights in the first place!"

  • lankester (cs)

    Very nice WTF and we learned one thing here ... no redundancy can save a system from human stupidity ....

     

     

  • Who wants to know (unregistered) in reply to doc0tis

    NAW!  Some guys with IQs over 10 would have been nice!   If they REALLY had NO idea of the importance, and didn't care, why didn't they just turn the air conditioners off!?!?  After all, it would have saved a LOT of money, and been quieter, huh?  But SERIOUSLY, network people should have been there to supervise, and the HVAC guys should have had a BASIC understanding of network wiring/cooling/plumbing, the NOC room, AND electricity. and plumbing.

    Steve

  • GoatCheez (cs)

    BAAAAAAAAAAAAAAAAAAAHAHAHAHAHAHAHAHA!!! BAAAAAAAAAAAAAAAAAAAAHAHAHAHAH!!!!

    Good stuff, good stuff.... was worried that it wouldn't be a good one cuz of the hidden network dealy.... looks like i didn't need to!

     

    I'm surprised though that the admins didn't know the maintenance schedule... with security like that I'd assume that an admin would need to be there while they were working so that he could let them in and to make sure they didn't f0x anything up.... Hindsight I guess lol ;-P

  • Noah Yetter (unregistered)

    I worked for the IT dept. at my college for 4 years and saw a lot of similar displays of incompetence from the facilities staff (which does not report to the college, long story).  The best one involves a partial campus power outage that hit the building with the primary data center.  Should be no problem, we have a huge industrial UPS with plenty of time to power down the servers before the batter runs out.  Of course we should have had them networked to the UPS so they could do this themselves, but the IT dept. was nearly as incompetent as facilities.  At any rate, we could not shut down the servers because the data center had a key-card lock that was on the main power circuit!  Naturally no one in IT had the key to the actual door-lock, which was apparently only in the possession of one member of the facilities staff who despite being based out of the same building was nowhere to be found until the batteries had nearly run out.  Fun times.

  • Howdy JO (unregistered) in reply to XH
    Anonymous:
    Sometimes I think people get what is coming to them.  

    Q: Why did the server farm monitoring solution not recognize the sustained component temperature and alert somebody before servers started failing completely?  

    A:  Absent, incomplete, or misconfigured monitoring solution.

    Q:  Why were the servers not instructed to gracefully shut down just before the point at which component damage occurs?  It is ALWAYS less costly to have a server gracefully remove itself from service with a two minute warning, but remain bootable, than it is to have a server fail in the red zone and require physical or software maintenance to restore.

    A:  Over-reliance on the monitoring system or misconfigured servers.  In a story about backups for backups, it is odd that the monitoring system did not have a backup.

    Any sufficiently large system that can't be brought back up within three hours is improperly managed and configured.  At one point in time, I worked for a company that had a larger linux cluster than google (early google, mind you) and we could go from catastrophic power failure to serving customers in less than an hour.  You see, in a system (power grid) with a backup (diesel + batteries), it doesn't hurt to have an additional backup (good "return to service" planning).

    One place I worked, a large educational publisher, a day-shift sys-admin just retired with 20 years of service.  The CTO (an adulterous, weak-hearted, ex-Enron weenie, but that's not important) boasted that the sys-admin had worked countless nights bringing the in-house servers back online during various power outages over the years (CPS power in San Antonio goes out just about everytime we get a good storm).  I remember thinking to my self, "If the sys-admin were not so damned incompetent, he wouldn't have had to work all those nights in the first place!"

     

    Harcourt, eh?

  • its me (cs)

    Ok, yeah the HVAC guys were idiots and probably didn't follow procedure, but then again they're HVAC guys. They don't know squat about servers and computer rooms, so while they should get blasted for not following procedures (I assume there were procedures....) the real WTF here is such a critical system is completely contained within the same room/building/city/state.... Any company needing true redundancy and failover HAS to have at least two datacenters, preferably not even in the same state.... My company's primary and backup datacenters are 200 miles apart, and that makes me nervous sometimes.... Can we say Regional Disaster anyone?

    Another issue is why didn't the servers gracefully shutdown when they became overheated? This whole situation stinks of poor planning, poor procedures, and poor testing of failover....

    -Me 

  • Dazed (unregistered) in reply to Phil
    Anonymous:
    And the AC guys were bank employees too, they just reported up a WAY different hierarchy, which, by the way, was the group that owned the security doors, etc.

    Now that puts the story a notch or two up the WTF scale. IME very few financial/administrative organisations have their own AC / heating / lighting / plumbing people these days - they're all hired in. On the few occasions when they do have their own staff, it's because they are specially trained and screened so that they can do this sort of work without supervision. To have their own staff cause a meltdown is a worthy WTF.

  • Savaticus (unregistered)

    You would think that if this compan spent so much money in replication and redundancy that they would do site replication.

     

    Captcha: stfu, yup prolly right.

  • Reed (unregistered) in reply to its me

    its me:
    This whole situation stinks of poor planning, poor procedures, and poor testing of failover....

     

    The only thing worse than poor procedures and planning.... is complete lack of procedures and planning.... 

  • Joker (unregistered) in reply to Unklegwar

    I'll bet the bank went with the lowest bidder when it came to hiring HVAC guys or the hosting facility (which in turn hired the cheapest HVAC guys or whatever.

     You get what you pay for.

  • Peter (unregistered)

    This is why you put your backup systems in a different location than the main ones.  (Then even a "giant robot wreaking havoc" wouldn't bring the bank down!)

  • tster (cs) in reply to XH
    Anonymous:
    Sometimes I think people get what is coming to them.  

    Q: Why did the server farm monitoring solution not recognize the sustained component temperature and alert somebody before servers started failing completely?  

    A:  Absent, incomplete, or misconfigured monitoring solution.

    Q:  Why were the servers not instructed to gracefully shut down just before the point at which component damage occurs?  It is ALWAYS less costly to have a server gracefully remove itself from service with a two minute warning, but remain bootable, than it is to have a server fail in the red zone and require physical or software maintenance to restore.

    A:  Over-reliance on the monitoring system or misconfigured servers.  In a story about backups for backups, it is odd that the monitoring system did not have a backup.

    Any sufficiently large system that can't be brought back up within three hours is improperly managed and configured.  At one point in time, I worked for a company that had a larger linux cluster than google (early google, mind you) and we could go from catastrophic power failure to serving customers in less than an hour.  You see, in a system (power grid) with a backup (diesel + batteries), it doesn't hurt to have an additional backup (good "return to service" planning).

    One place I worked, a large educational publisher, a day-shift sys-admin just retired with 20 years of service.  The CTO (an adulterous, weak-hearted, ex-Enron weenie, but that's not important) boasted that the sys-admin had worked countless nights bringing the in-house servers back online during various power outages over the years (CPS power in San Antonio goes out just about everytime we get a good storm).  I remember thinking to my self, "If the sys-admin were not so damned incompetent, he wouldn't have had to work all those nights in the first place!"

     

    Not all systems are as easy to boot up as a Linux box.  Hell, not all systems are as easy to boot up as 100 linux boxes.  especially if a component inside it has failed.  I know of a company that works with my company that had a part fail in one of their production storage devices and it took them over 20 hours to bring it back online.  granted this is extreme (the average case is 2 hours), but it happens. 

     

    This is why you put your backup systems in a different location than the main ones.  (Then even a "giant robot wreaking havoc" wouldn't bring the bank down!)

    depends on how giant.
     

  • Scot Boyd (unregistered) in reply to captcha: hacker
    Anonymous:
    WTF? How did the HVAC guys get the automatic secure doors propped open without anyone know about this?
    Heck, in the few secure environments I've worked in, the HVAC guys wouldn't even have access to the server room.
  • Milkshake (unregistered) in reply to stupid things we once did

    Anonymous:
    I worked at a place where the on-site data center (used for development only, not UAT, prod or DR) was perceived as super-critical, and had to be kept up and climate controlled at all times. To this end, they had this MASSIVE (15 feet long, 3 feet deep, 8 feet high) heater/A-C/[de]humidifier that was WAY oversized for the room. Then, just in case it broke down, there was another one on the other side of the room. Every once in a while, on a Friday afternoon, we'd turn both units on (we could override the auto-controls), set the temp to the minimum, and override the humidifiers to run at full blast, and leave both running that way over the weekend. If it was cold enough outside, the inner building temp got down to about 55 on the weekend (full shutdown), so the computer room would actually have a layer of frost on everything by Monday morning. Everything was shielded, so nothing ever got damaged from the water, but it made for a great indoor snowball fight.

    Paula,

    Please, tell your new co-workers about this site.

  • foxyshadis (cs) in reply to Noah Yetter

    Anonymous:
    I worked for the IT dept. at my college for 4 years and saw a lot of similar displays of incompetence from the facilities staff (which does not report to the college, long story).  The best one involves a partial campus power outage that hit the building with the primary data center.  Should be no problem, we have a huge industrial UPS with plenty of time to power down the servers before the batter runs out.  Of course we should have had them networked to the UPS so they could do this themselves, but the IT dept. was nearly as incompetent as facilities.  At any rate, we could not shut down the servers because the data center had a key-card lock that was on the main power circuit!  Naturally no one in IT had the key to the actual door-lock, which was apparently only in the possession of one member of the facilities staff who despite being based out of the same building was nowhere to be found until the batteries had nearly run out.  Fun times.

    Windows Servers at least since NT (some service pack) can initiate automatic shutdowns when UPS power levels get near critical. Various flavors of unix have the same functionality.

     Anyway, I'm sure most sysadmins have had this story once or twice; this is my job, and I know it well. One day, during the long preparations to move our building, I walked into the server room and immediately started cursing, it was around 110F in there and tempurature monitors were all blinking. The secretary had let the AC guys remove the AC a week early to place it in the new office without notifying me, and in its place was some humidifier that was dumping the heat right back into the room. Somehow, even though she sits next to the room and had left it open to vent, she hadn't decided that dumping heat into the room was anything to worry me about.

     In the end we had to keep using a mini-unit (fixing the heat leakage) and turn the building AC down with the doors propped open, because the AC guys refused to switch it back, so I shut down all but two servers (having mostly virtual servers helps) and pulled disks out of the SAN to keep it cooler. Arrgh. Good plans can't really budget for stupidity, although at least a couple of servers are designed to go down on excessive heat now.

  • biziclop (cs)

    I've seen once a large server hosting facility going down for a whole day because some imbecile installed a smaller fuse (300 A instead of 500) in the main PDU and although everything was highly redundant, the effect cascaded through all the other circuits.

  • Jethris (unregistered) in reply to Howdy JO

    I worked in an Air Force data center, monitoring mission-critical satellites.  We had redundant systems, and a fail-over generator in case of power outages.

    Then a HVAC guy who was in looked at the ominous "Emergency Power Off" switch and told me that it was disconnected.  I doubted him, as I had just been given the safety briefing.  He said, "It doesn't work, watch." and proceeded to hit the button, instantly dropping all servers, external drives, 9-track tape units, lights, etc.

    yeah, he got in major trouble over that one.

  • Solo (unregistered)

    I work for a cute little 25 employee company (now just a branch of a 25,000 enterprisey company, but same tiny office of 25 drones), with only 1 (one) onsite net admin and 1 (one) onsite dev dude (=me) and 5 servers, in a tiny server room, with its own AC. We've only got two keys to the server room. One for me, one for the admin. And no, the master key to all the closets, offices and doors does not work on that door.

    This is a good example of security by the book. I'm sure access to the server room was well restricted to authorized personnel. Except that authorized personal includes the numbskulls from maintenance, the security company, the concierge troll, the cleaning crew and probably another 25 other people.

    captcha: enterprisey, indeed!
     

  • biziclop (cs) in reply to Kevin
    Anonymous:
    Anonymous:

     And I think the real WTF here is that there wasn't any service agreements with the hosting building about how access to these systems / site is controlled for maintenance purposes. Or if there is, who had their head chewed off from this major failure in following the proper procedures.

    I think the real WTF is that a company would keep their backup servers in the same room as the real servers.  What if a circuit breaker reset and cut power to the entire room?  There are countless things that could happen within a single room to cause both the main and backup systems to fail.

    Normally you use several power sources and a PDU. (It prevents power shortage, if you calculate the capacity of your fuses precisely.:))  Nevertheless, co-location is the solution. 

  • rmg66 (unregistered)

    I worked for a very prominent non profit organization that specializes in disaster readiness. Unfortunately management didn't have the money slash inclination to make sure thier own data centers were disaster ready.

     Two incidents same root problem: They wouldn't spring for a indepenant cooling system for their data center.

    First incident: During the Northeast US multi-state power outage we had a couple of summers back we, of course had diesel backup power. Lluckily the data center was connected to it. Unfortunately the central cooling system was not. I spent 3 days with propped-open doors and fans galore to make sure we didn't over heat too badly.

    Incident two: Coming in one Monday morning I walked into a server room that was at least 130 degrees. Everything was hot to the touch. Seems the central cooling system threw a fuse and had shut down Saturday night. Nobody notified anyone in IT. To make it worse, IT reported to the Director of Adminstration who was also in charge of facilities. Nobody notified him either. Another round of fans and propped open doors ensued...

     

  • Mr Ascii (cs) in reply to Jethris

    We had a tape drive service tech hit the Emergency Power Off switch on the way out the door thinking he was unlocking a magnetic door lock. The switch got a cover the next week.

  • cj (unregistered) in reply to Reed

    its me:
    The only thing worse than poor procedures and planning.... is complete lack of procedures and planning.... 

    I'm not sure about that.

    Where I work, we have very little procedures or planning, but we know that we don't have anything useful.  Hence, we're working to get procedures into place for when a failure/disaster/etc. does occur.  Knowing how things are here, I doubt that the procedures will ever be maintained once they are finalized.

    What's better, not having anything and knowing that you don't, or relying on something so outdated that it is no longer of use?

     

    captcha: null

  • pjsson (cs)
    Alex Papadimoulis:
    ...One summer day, at about two in the morning... HVAC Guy: It didn't fail. We're just changing the chiller bars and doing some other preventive maintenance.

    Who does preventive maintenance at 2AM? I think we have a bullshit story here.

  • fly2 (unregistered)

    At a company I once worked for, we had a failure of some servers and the entire telephone system. The reason was that the network admins turned off the AC whilst doing maintanance work in the server room because it was too cold. You know, it was summertime, the ofice building didn't have AC (only the server room) and therfore got quite hot, so everyone was dressed accordingly (no dress code in the .com bubble area). Then they forgot to turn it back on. Ooops :)

    A similar story from a collegue at a different company: Here the server admins had their workplaces right next to the server room and in summer left the door open so that the AC would cool their (not conditioned) room also. Unfortunately the AC wasn't up the task, overheated and consequently failed.

  • fly2 (unregistered) in reply to pjsson

    pjsson:
    Alex Papadimoulis:
    ...One summer day, at about two in the morning... HVAC Guy: It didn't fail. We're just changing the chiller bars and doing some other preventive maintenance.

    Who does preventive maintenance at 2AM? I think we have a bullshit story here.

    I agree. At least the 2am thing is probably exaggeration 

  • rmg66 (unregistered) in reply to fly2
    Anonymous:

    pjsson:
    Alex Papadimoulis:
    ...One summer day, at about two in the morning... HVAC Guy: It didn't fail. We're just changing the chiller bars and doing some other preventive maintenance.
    Who does preventive maintenance at 2AM? I think we have a bullshit story here.

    I agree. At least the 2am thing is probably exaggeration 

    I would there imagine there are alot of companies who specialize in off-peak hours maintenence. Do you really want a bunch of guys working with ladders and such while you are trying to do your job.?

  • Jimbo (unregistered)

    The sad part about this is that I work for a company that does about the same thing, and just today I found out that one of my process had stopped functioning last night without alerting anyone (although thankfully the fault fell directly to my boss' son).  The down time cost our company 16 thousand dollars due to being down between 2 am and noon when I got in to check on it.  :(


    To put this loss into perspective, i'm directly out of college and making 27k a year.  The loss would've been over 7 months pay...

  • Jessica (unregistered) in reply to XH
    Anonymous:
    Sometimes I think people get what is coming to them.  

    Q: Why did the server farm monitoring solution not recognize the sustained component temperature and alert somebody before servers started failing completely?  

    A:  Absent, incomplete, or misconfigured monitoring solution.

    Q:  Why were the servers not instructed to gracefully shut down just before the point at which component damage occurs?  It is ALWAYS less costly to have a server gracefully remove itself from service with a two minute warning, but remain bootable, than it is to have a server fail in the red zone and require physical or software maintenance to restore.

    A:  Over-reliance on the monitoring system or misconfigured servers.  In a story about backups for backups, it is odd that the monitoring system did not have a backup.
     
    Seriously.  We used to have problems with people coming into our dev/qa server closet and shutting off the air conditioner because it was "too noisy", and we now have temperature monitors that warn us when the temp rises and shut down the systems if a specified threshold of time and temp are reached.  And we have access control and procedures with security about things like "phoning the sysadmins if the register any alarms on the room".
    It's not foolproof, and we have had moderate A/C issues before, but nothing on this scale even in even our poor man's excuse of a data center.
     
    I shudder at the idea that a high-availability production data center would be so lax.  Mission-critical high availability equipment should be in a data center with access control procedures and change control procedures that also includes facilities.  Or even more than one, as some have pointed out.
     
    I would never expect the facilities people to "get it".  Our facilities manager once claimed that shutting off all the A/C units in our server closets for 48 hours wouldn't require shutting down any equipment because it was October so we wouldn't really need A/C anyways.  After all, it would only be 50F outside!  Why would we possible need air conditioning units for our server rooms?
     
  • Saladin (cs) in reply to rmg66

    Anonymous:
    I would there imagine there are alot of companies who specialize in off-peak hours maintenence. Do you really want a bunch of guys working with ladders and such while you are trying to do your job.?

    Exactly.  Having maintenance do their stuff during off-hours actually sounds like a pretty good idea.  Better than taking the A/C offline during business hours, you know?

  • monster (unregistered) in reply to XH

    I remember thinking to my self, "If the sys-admin were not so damned incompetent, he wouldn't have had to work all those nights in the first place!"

    Now that's a little unfair. Or at least might be, since I don't that company. If that guy tried to build a better system but the CTO/CEO/whoever didn't approve the spendings for various kinds of hardware, or he was forced to give other stuff priority over making the system better, or if he was simply swamped because he was the only sys-admin for way-too-many boxes & applications & users... well, in that case I'd say it wasn't his fault.

Leave a comment on “Keepin' It Cool”

Log In or post as a guest

Replying to comment #:

« Return to Article