- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
You've not talked to your HVAC guys have you?
Admin
Data center personnel are frequently retarded.
I had a client once who decided that instead of hiring a system administrator through my company, he would instead host his server at a full-service web host that provided hardware and automatic O/S updates. Once he found a plain old random web host who promised what he wanted (without any apparent intention or ability to actually deliver it), he moved the server there, and I started getting phone calls every month demanding to know what I was doing to the server. The server, it seems, was going down on the first of every month and becoming unreachable - so we had to call the data center and ask them to cycle the power. The data center swore all up and down that nothing was going on with the system.
Eventually, when the client threatened to sue my company for damaging his business, I put my sysadmin on the case to find the problem. Upon investigation, we found that this web host had a script that ran on the first of every month to download the latest updates to the operating system, and it began by booting the system down into single-user mode where the update script would run. (Not being much of a UNIX sysadmin, I don't know how common it really is for an update script to boot down into single-user before updating anything, but it seems like that should be a WTF in and of itself.) This update script was broken, and failed before reaching the part where it should re-init() the system into multiuser mode... so every first of the month at midnight, this wonderful full-service web host was putting our server into single-user mode and leaving it there.
Admin
on my university, our logs reported tons of shutdowns and start on servers. Whas electricians doing testing on lights to force something to break and replace. (we where poor and withouth SAI).
Admin
That's just inviting Murphy over to fully kick your ass. At least you were bastardly enough to let him demonstrate it.
Admin
come on, with all the redundancy that place had, surely they had a backup backup server that monitored the air conditioners
Admin
My two A/C WTFs:
1) A common datacenter on the UPS, aircon not on the UPS with no thermal shutdown trick. When I came in to work on monday there was a lot of techs running around like headless chickens, and it was hot hot hot.
2) The mysteriously cursed SAN disk array.
20 * 10krpm disks in a cabinet. Every 10 minutes or so a disk would stop for a minute or so then come back up. Then the cycle would repeat, with a different disk every time.
Turned out the with 20 disks running the temp in the cabinet would creep up above the thermal shutdown threshold of the disks, and with only 19 running the temp would drop down again. The disk that rested would always be the coolest (although it would have catching up to do). Took ages to work that one out - we assumed it was a firmware bug. The solution was to gaffa tape a 120mm fan on the intake vent, there just wasnt quite enough airflow to keep the whole cabinet at (server) room temp.
Admin
Reminds me of a school I use to work at. There was a dedicated server room which had a split system AC.
One year I come back after the xmas break, open the server room door and get hit by a heatwave. What had happened was the outside part of the AC had an off switch and someone had turned it off during the holidays!
This switch is now out of reach of children :)
Admin
Plenty of people. if something goes tits-up in your maintenance and it takes down the farm, would you rather it at 2am and lose a couple hundred thousand dollars in productivity, or at 2pm and lose tens of millions?
Admin
20 years ago, we had a Radio Shack robot in the data center (tiny shop). It would monitor power, temp, and noise, calling a preprogrammed list of numbers until it got someone who would enter a code to say "OK already, I'll come deal with it". Modal failure was that someone would leave the cover up on the printer (1403) and the noise would set it off; if you picked up the call when it did so, it would say, "NOISE LEVEL IS HIGH, LISTEN FOR 15 SECONDS" and you'd usually hear people saying "What is that?" "Oh, crap, we set the robot off again" and suchlike.
You could also call the robot and it would tell you the current temp etc.
But the really weird thing was that the phone it was connected to -- the only hard-wired outside line in the place -- would ring several times a day. We never managed to get to it in time (or if we did, the caller would just hang up). It was WAY too frequent for random wrong numbers.
It wasn't until I was driving home at 2AM one night and heard an ad for a suicide hotline. When they got to the end and announced the phone number, I almost drove off the road laughing: it was the same number in the adjacent area code (this is the DC area, so it's right next door). So here's some poor bastard, at the end of his rope, calls the (wrong) number, and gets a robot saying, "THE POWER IS OK...THE TEMPERATURE IS OK...THE NOISE LEVEL IS OK..."!
As a colleague observed, "At least it wasn't Nike headquarters, where the ACD presumably says 'Just do it'..."
...phsiii
Admin
Uh yeah. I was told a similar story. There was a super-secured server with a perfect UPS and its warm-backup, running fine for long time. Then one day it stopped. Cuz the janitor pulled the plug of *both* main and backup from the socket in order to power the vacuum-cleaner.
Also one of my critical systems failed once: the main computer was was installed in its chief operator's room. Then he went n vacation for two weeks, and the doors, windows were closed, air-conditioner turned off, and some really hot and sunny days came. Then the processor overheated...
Admin
Some time ago, the servers at our colo that had did most of our hosting all went down pretty much simultaneously. Like, completely dead, unresponsive, kaput. We were quickly notified of a fire emergency that had happened; luckily, though, the cooling system kicked in and there probably wasn't too much damage that our hardware had sustained.
Well, this unfolded like a soap opera, until we found out the truth. There had been no fire. They were testing the cooling system. Well, the test sure worked. The temperature of the room had gone down by about 40 degrees in less than a minute. Most of our hard drives did not work after that.
We have a new primary colo.
Admin
I had a professor in one of my programming classes in college that ran the online portion of a neighboring college. The online system originally started small and didn't need any heavy duty cooling systems, but eventually as online classes became more popular at the college the number of servers grew. Eventually there were enough servers to justify buying a dedicated AC for the system and when it arrived it was installed on the same circuit as the servers themselves. Needest to say, this circuit wasn't exactly made to handle the load of a heavy duty AC plus 50 or so servers and the breakers popped and the system crashed, taking out a number of servers with it. The HVAC dudes later realized that the building had a number of high load circuits put in specifically for systems like air conditioners. I think it's time we started questioning the intelligence of HVAC technicians.
Admin
I worked in IT for the security department of a major university. We were moved to new facilities before we where allowed to test the backup systems. We tested them, finally 6 months after we moved in. We had a UPS and backup generator.
When we attempted to go from UPS to generator we blew the whole electrical system because the breakers were make before break and it went bang. We could not get into the server room, they had forgotten to put batteries in the access control system power supplies. The keys were in electronic key safes, controlled by one of the dead servers. The airconditioning was not on UPS, and the generator was not big enough to run it either. We where down for nearly five hours. that was 2 years ago. Last I heard the UPS had not been repaired and the circuit breaker was still make before break. The servers in this room control the perimiter doors of 240 multi-story buildings on 5 campuses and all the CCTV and alarm systems.
I now live in another state, and watch the news for a story about the ransacking of whole university one hot summer night...
Admin
What do you think all those critical notification alerts were??
I'm a robot. and i like those eady captchas. :D But i got no feet :(
Admin
OMG! Two serious WTFs in ONE thread. These are good times... :D
But thank god i'm not your boss. If i found out, I would have locked the units so you could not disable them, then use a group of 30 overly strong men with black suits and sunglasses to take away all your clothes, locked the room and kill any communication connections with an axe from the OUTSIDE of the building, and finally tell you that to stay alive your only chance were to get *very* near to each other!
GWAHAHA AHA HAHA.....
captcha: pizza (yeah! dream of it while eating snow and body fluids!!!)
Admin
One of the multiple WTFs here is that "the automatic secure doors were propped open", apparently without any alarm sounding or security personnel arriving.
Admin
Did you know that Google codesearch beta offers an wtf:brillant search option?
Admin
so why did he get the day off? I would guess it was in sympathy but the way it was written kinda makes it seem like a bad thing
Admin
I also found a monday the AC down and the server room hot like hell. In our file server, the plastic casing of the disks MELTED. The front panel of the server case, also in plastic, melted too and looked like a Dali painting.
I don't know how, half of the disks were still working and we don't even loose the RAID array. We had to replace all of them in the following week as they fail one after the other, but we didn't even have to restore the backups.
And no, the moral of the story is not that backups are useless ;)
Admin
A co-worker of mine visited a customer's data center which had stairs from the door down to a recessed floor. The stairs ran along the door's wall, all the way to the adjacent wall (i.e. a corner). In that corner, facing the stairs, was the Emergency Power Off switch. She watched an employee come into the data center, start down the stairs, and trip. His hands went out to catch himself, slamming into the Emergency Power Off switch, with the predictable results.
Next time she visited, they'd put a little cage over the switch ...
Admin
The real WTF here is "which could only mean one thing: a bomb, a fire, or a giant robot wreaking havoc throughout the city". Somebody needs to re-read the definitions of "only" and/or "one".
Admin
I'm sure everyone has one or two of these. Mine is the redundant servers with redundant halon systems, redundant UPS, redundant A/C, all of which overheated and failed when the single thermostat pooped out. Or the data center with everything completely redundant including servers, power, A/C, halon, AND dual, seperate, dedicated phone lines carrying the same data to the other centers, compared and checksummed at each end, both of which were knocked off the buidling by a single backhoe.
Admin
Reminds me of a server room at a client's site (yeah yeah, spare me the consultant jobs).
They had a failure of the A/C in their room over a weekend, so the temperature raised slowly. Thankfully, someone had wired a cutoff switch on the power feed to the server room to prevent a meltdown. When the temperature reached a certain level, power went out. Cool.
The server room had UPS backup batteries in case of power failures. So naturally those kicked in. Batteries don't last, so the servers shut down.
Temperature went back down. The cut off switch was deactivated, the servers went back up. All is great, except there's still no A/C.
Repeat, until your UPS batteries wear down and cutting off the power just kills the servers right away, every 5-10 minutes or so.
Wonder how many drivers they'd have lost (aside from the 1-2 they did) if some ahem consultant hadn't worked over the weekend to open the server room door.
CAPTCHA: batman
Admin
You ever notice that cockeyness and ignorance seem to be strongly correlated in some people? Man, I hate those people.
Admin
Note to self: Want to gain access to highly secure and sensitive banking data? Dress up as HVAC guy.
WTF 1: How did the HVAC people gain access w/o the people responsible for maintaining security/uptime knowing?
WTF 2: How is it that the HVAC guys have access unescorted to a room full of sensitive banking data?
WTF 3: How is it that there are no SLA's that protect/inform the HVAC guys that keepin' the servers cool is a big deal
WTF 4: How is it that an HVAC guy sees two redundant HVAC designed to cool a server room and doesn't think... servers must be kept cool... that's why there are redundant HVACs
WTF 5: How is it that a financial institution (ie: large bank) that spent unknown dollars on clusters to milk every dime out of 24/7 financial tansactions doesn't have a spanning cluster? (geographically distributed cluster)
Admin
At which point on the next visit by your co-worker, she started down the stairs, and tripped, and her hands were ripped off by the new little cage over the switch?
Admin
I was a dev, not an admin, so don't blame me :P
And it also gets much worse than just that, if we're talking about the same place. Like walking into the server room and stepping into 1" of standing water on the floor because nobody had emptied the dehumidifier/AC condensation collection buckets for a week. Buckets, of course, because while management had planned out the perfect server room with millions of dollars worth of hardware, nobody had thought to consult a plumber.
Of course, even this was an improvement over the previous arrangement, consisting of multiple servers connected to power strips daisychained into wall outlets.
Admin
War stories all around!
Story 1: In my last job, we had installed an instructional lab in a room which was designed & built prior to our moving in. Result - insufficient power to the room. An instructor came to teach a class (and had brought his own hardware), and when everything was turned on it tripped the breaker.
My buddy went to the panel to reset the circuit, and was trying to figure out which switch it was, when instructor guy "helpfully" pushed his way in and started flipping all of them. He made it a few switches in before buddy o' mine managed to hip check him.
All of our servers (and most of our desktops) were on battery backups, so no harm there. Unfortunately, he also got to the circuit for the compressors on our HVAC system. The air conditioning units didn't like having their power cycled so quickly, and failed to come back online.
Story 2: Landlord hires roofers to do some work above our ops room. It's near the end of the day and they're not done, so the foreman squints at the sky and says, "Welp. Looks clear to me. Guess we're leaving it like this until tomorrow."
Overnight, it rained. When I arrived the following morning, my coworkers were throwning plastic tarps over the server racks to keep the water from pouring directly into the chassis. Good times.
Admin
Why TF are the backups in the same location as the primaries?
Admin
Maybe they don't have a choice!
Or at most places that I've worked , the rule was.
We have to keep these online 24x7x365 and you get a 2 hour window Midnight to 2AM on Sunday morning to actually try to do the work you suggest.
The rest of the time can easily be filled with firefighting operations caused by the very same lack that you bemoan. That's from 15 companies over the past 20 years (The life of a contract employee). Everything from Dot Bombs , Telecoms to National Laboratories. It's been nearly the same at every place I've been. Talk about it, but no time nor resources can be allocated due to the amount of work (firefighting) the staff is currently doing! Oh and by the way, since you're a money sink, we're firing 2/3 of the current staff to save money (just happened 4 months ago).
To Quote Monty Python "You were lucky to have a lake! There were a hundred and fifty of us living in t' shoebox in t' middle o' road."
That's become one of my rules of life in Sys Admin IT. Damm good thing I still love to tinker with this stuff and dream of a type of situation you found yourself in!!
Admin
Typical - the suits don't realize they're administration like in grade school and the technology experts, aka *geeks*, know better then to allow a maintenance man in to a server room without a plan of what they're going to be *doing* so things like (# of employees * pay per hour ) * hours server room is down adds up rather quickly.
< CAPTCHA = OOPS did we do that? />
Admin
HAHA, I usually don't like the story WTFs, but this one is great. So many breakdowns I can't begin to imagine the fingerpointing later that day -- I'm imagining splints for the managers' overworked index fingers.
Yes, the most basic question here is why would 2 chuckleheads be allowed to sit in a server room by themselves? This bank is very lucky that the two guys were merely clueless and did relatively little damage (maybe a few hundred thousand dollars in repair/lost profit?) What if these guys had been intelligent, malicious theives or spies? They would have had direct, physical access to the banks most crucial information!
Awesome.
Admin
It's always a fun game to see how many critical systems you can daisychain into one outlet before you reach the limit. Kind of like Jenga, only with the risk of a serious fire.
Admin
Of course, if they really knew what they were doing, its hard to imagine them worrying about something like "# of employees * pay per hour", in relation to a server being available.
Admin
At the time we (the guys who had the snowball fight) were just young, lowly peon-developers, stuck in an unusually freezing-cold server room attempting to code with numb fingers. The building itself was a very old chemical foundry (didn't know that when I accepted the job). In order to get rid of the chemicals that spilled onto the floor, the concrete was poured with slits (like sidewalk spacer pads) to let the chemicals seep into the ground. Our server room was on a raised floor above this (don't get me started). Any water simply dripped through the raised floor, and then seeped-away.
As a side-affect of working there, every now and then, someone would get sick and the DEP guys in hazmat suits would come around looking for toxic chemicals in the air. Many people broke the window locks and had the facilities guys weld the windows in a position such that they were open just a crack to let in 'fresh' air, but couldn't be opened. The security folks didn't like it, until someone found out theat they had the same thing done in THEIR office. Needless to say, I didn't stay long after seeing that the first time.
Admin
My reply would have been: "To save time, this message is for the both of you: YOU'RE FIRED!".
I would have waited until they they the A/C working again... maybe.
Dr. Kluge
Admin
Fricking HVAC. I've got all kinds of problems with them, from unplugging things because "they need to plug their tools in" to leaving an air conditioner that was gushing water from a broken line with a piece of wire holding the broken part up for MORE THAN A YEAR, until the wire corroded and the AC started dumping water into the subfloor.
Frankly, I don't think you should be able to get ceritifed on corporate and critical server room ac at the same time. Those bastards cause nothign but problems, and you have to watch them ALL THE TIME.
Admin
Amen, brother. Preach.
The system adminstrator's job is to keep the systems running smoothly. Therefore, every time you request a wider maintenance window in which to work, you're demonstrating that you can't keep the machines running smoothly. Proactive sysadmins are penalized, and the ones who keep things barely scraping along are rewarded.
Admin
Also, in the US, banks are penalized the amount of interest again that they would have paid for the money being in their accounts for every second that they don't journal the transaction -- i.e., they have to pay the interest ANYWAY, and then pay a penalty equal to that amount of interest.
Most banks have a staff of people who look for banks that have interfaces that are down, to take advantage of this.
Admin
This one reason why we have two data centers, geographically separated, with different co-location companies. Its extremely unlikely that both of them will decide to do something stupid at the same time.
Admin
Reminds me of the day that I and my former colleagues (6 developers in a room) were all working on our application when that super-smart cleaning guy came in and wanted to borrow our power socket for his vacuum cleaner. Actually, it wasn't that bad - he asked for permisison about 20 milliseconds before pulling the plug. Just a little too late for us to collectively scream NOOOOOOOOOOOOOOOOOOOOOOOOO!, of course.
(Someone once observed that every economy needs people who aren't overqualified for simple jobs.)
Admin
Quoting an Anon:
My building had a diesel generator as well. Of course, we shortly discovered that PM had not been carried out on the generator; its cooling system was shot, so it kept overheating and shutting itself down.
Yet another fun example of "there is no such thing as sufficient redundancy."
Of course, my personal corollary to that is "there is no such thing as information so critical that, if it were lost, society would end." Of course, the fact that people *act* like there is information that important, is its own separate and glorious W T F.
Admin
(whoops, wrong thread) O_o
Admin
Everything's stolen and here is the evidence!
I'm gonna sue Open Source into oblivion.
Sincerely,
Darth McBride
Admin
What about "how to breathe"?
Admin
No, it is actually worse than that. The cage is now bolted into place with probably about six 4-inch wood screws (phillips head). What is the cage covering? It is covering an Emergency Power Off switch. So now, when something seems to be about to catch on fire, you will have to find a phillips head screwdriver, and then remove the six 4-inch wood screws just to get to the switch.
CAPTCHA: tango (I wish I knew how to tango)
Admin
Guess that's why I'm never in the reward line. Too busy trying to make my life easier to play the chicken little game trumpeting how I "saved the day" yet again....esp when it never should have happened in the 1st place!
;-)
But since I usually am 1st in line to step into the tar babies that no one else wants (64bit Solaris Sparc custom compiled Samba servers with full Windows AD support anyone??) I get to get into some cool areas where the lesser admins fear to tread.
Admin
I've had a similar event in a hospital (far far away) where I worked. My company had sold and 11/70 with dual RP06 drives. A DEC engineer was installing two RP06 disk drives and decided to show off. "Look I can drop the power and the safety circuit will retract the heads and power down the units before any damage can be" click, SCREEEEEEIEIEIEIEIEIEIEIEIEEEECH. All 18 heads on both drives were wiped out. He had forgotten to hook up the circuit. DEC could not get the heads aligned well enough to read either of the backup disks we sent (two real disks were loaded into the drives, natch). The good news is that I got a $100K/year job out of it (in 1982 when that was serious money) after going over to fix some of the damage.
Admin
Admin
It's currently in production for a W2k3 network. (Un)Fortunately I don't have to deal directly w/ the ACLs here since it is the Windows front end to another product that handles the file accesses. It's the CIFS file server pipe for the application behind it.