- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
When all logical steps fail, reboot.
If that fails, wipe & reload.
And if THAT fails...
Admin
Should there not be a label somewhere on the box that says "MISSION CRITICAL SERVER" or some thing? Shouldn't the test server be labeled as such? The real WTF is that these people don't know what labels are.
Admin
would have been even funnier if the hardware guy would have formatted the drive thinking it was just a test server and he was trying to reconfigure the whole thing. Then a whole day of trades would have been lost because somebody didn't label the server and the "critical" server got wiped out. And if they didn't have a backup cd on hand....
Admin
That's a pretty epic WTF.
Admin
And we all learn a valuable lesson: Put clear labels on all servers, and possibly even a map to each server in a cabinet on the door of the cabinet.
Admin
Hilarious. And far, far, far too commonplace.
Admin
This is a mission-critical server, and they don't have a hot standby? If the server fails for ANY reason, it should automatically fail over to the standby. (I think we all know what needs to be done for the standby - separate power source, separate network to a seprate backbone, along with a third hot standby located halfway across the country...)
Relying on any one single piece of equipment to operate your business is foolhardy at best. I hope these "upper management" never have to explain to the board of directors that the computer died and so we lost a whole days revenue.
Admin
The real fun part is that Michael covered for the hardware engineer. I hope the hardware engineer purchased him a beer or two for that one.
Admin
This is an investment in leverage. The 'ole "YOU ARE MY B*TCH" scenario. I can only imagine that there's some benefit, or that Michael is some uber "nice person (tm)"
Admin
Of course, then you end up with the old "seperate power source s plugged into the same power strip" problem. I've seen it. It made me laugh.
Admin
Too many real wtf's in this story.
If Michael is so important, what is he doing living an hour away? Or rather, an hour's drive at 2 AM away...?
Admin
I suspect the anonymization ruined this story. Either that or every test/diagnostic was effectively worthless.
I trust they invested in a monitoring solution after this incident.
Note from Alex: Like most, this story wasn't anonymized beyond redacting key details (location, names, etc). The only things that get really "anonymized" are the "big systems" and code samples -- they're too specific and far too recognizable to a lot of people, so I have no choice but to change them. This story is, sadly, universal!
Admin
Admin
In general, by the time you've graduated to being the 'last resort' at 2AM for a large company, you've also graduated from Caffeine During the day to massive amounts of Alcohol at night.
Admin
Admin
You mean that's not the first steps? :-)
Admin
There are some days I wouldn't be able to handle massive amounts of alcohol at night if I didn't have the massive amounts of caffeine during the day...
Admin
Two words: Orange Cable
CAPTCHA: onomatopoeia (WTF?!)
Admin
I'd say Michael got some pretty good credit out of the situation. Even though he said "we just rebooted it", Management will still know that when he came in, the problem magically fixed itself.
They think he's got some magic touch. Or maybe there was some complex fix but he knew he was so smart that they'd never understand, so he didn't even try. Look at that modest guy, saving the company and acting like it's nothing. Let's give him a raise. ...Not like that punk hardware engineer who just follows people around and nods looking guilty.
On second thought.... this probably just means he'll get more calls at 2AM.
Admin
Who'd have thought it possible for an entire physical server to be 410...
Admin
Here's how the story would have gone if it was me...
2 am: rrriiinnngggggg
me: <unplugs phone>
Seriously, what kind of tool goes into the office at 2am? F- that.
Admin
Label?!?!? If it's mission critical to your business and in control of your customer's money it should be locked behind several doors that require everyone's authority to open (or at least to power down).
And why is there only one server for the "mission critical" function? And I'd assume it's on something <facetious>stable like Windows NT</facetion>.
Bah!
Admin
The kind that's paid several hundred dollars an hour to be available at 2am.
Admin
Best thing about the story, is that he covered for an engineer who made an honest mistake. That is the type of guy I think we would all like to work for. Of course, the second time that happens, I would hang the engineer out to dry.
Admin
True. My buddy gets paid extra to be the one on call 24/7. If he can't fix it, then I get called.
Admin
Sadly too realistic... but so true. Now if only I were paid that much $$ when I had to do it...
I had that setup for a prior contract and until I came on, most of the servers were unlabeled like that. Glorious stuff.
I recall having an "incident" once that was vaguely similiar - I was installing a (not-so-critical) data collection workstation in a power plant's control room, and somehow a cable connected to a critical data collection workstation right below became loosened. Lord knows why it wasn't actually screwed on, but the operators lost access to part of the control system. Boy did I get reamed over it. Luckily for me, the guy who did the implementation happened to be on-site that day and knew exactly what would be the cause.
Yeah and he had the maintenance team actually screw down the cable (and check all other workstations in the control rooms) afterwards too :)
Admin
Bud Bundy is that you?
Admin
Of course, if there wasn't a backup (server) already in place, I made it my business to inform the upper brass, at 3AM, whether they were at the office, or in bed, that they had better spring for $$$ or they would be getting more 3AM "alerts" - and it invariably worked, too!
Admin
In fact, there was a label, but it simply said "Orange cable."
Admin
The first clue that my friend in Chicago got that his apartment had been broken into was a "host not found" error connecting to his home computer, which eventually was revealed to mean the host had been stolen. Yes, he had backups.
Admin
Admin
I've been there, but usually not so bad. I remember one time when working around an AS/400 with the communications lines trying to figure them out. Finally figured out which lines went to which offices, saw an "extra" line up, hmm.. must just be an old configuration and disabling it.
5 minute later I get a phone call, "This is the Atlanta office. Our connection just went down." "Oh, okay, let me check it." Reactivated it, "Thanks, it's up now. What was wrong?" "Not sure, just powercycled it." Hung up. Then walked out of my office, "Why didn't anyone tell me we had an office in Atlanta?"
Admin
Classic.
Admin
A monitoring system like nagios that does heartbeats on all the servers would've saved the day here.
"Oh, look, the production server isn't responding to anything."
Admin
The story would have been most boring, if it had gone that way.
Admin
They probably should have used little orange labels that stated 'not an orange cable but a production server'.
Admin
"And attempted to resynching the connection."
[image]Admin
The company I used to work for required two seperate power sources for each server. Even some servers we ordered with only one power supply because they were not mission critical and utterly expendable were retrofitted with a second power supply just to be on the safe side.
One day, I get paged because all servers were gone from the net. All of them. I drive to the datacenter...strange. All of the servers are reporting a power supply failure, but are still operating.
Turned out that all switches did not have a second power supply because "you can always have them plug the servers into the other switch if one fails". Too bad that in the meantime all of the ports on "the other switch" were already in use, and management had never approved the budget for the remote hands service provided by the datacenter personell, so one of us had to drive 100 kilometers to fix the problem.
Admin
does anyone else think its weird that they didnt have a fail over system ?
where I used to work we had overkill fail over multiple failover clusters and a disater decovery cluster at a remote location in case the building was bombed. still it was a fund management company.
Admin
Agreed, the missing label is clearly the WTF here. My former employer had pretty nasty server room with all kinds of boxes everywhere, but they were all labeled. Even the most temporary test box.
Admin
Michael just saved the butt of his Hardware Engineer.
Admin
We are, most likely, talking about New York here. The office would most likely be in the financial district in the south of Manhattan. You just don't find housing less than an hour from there. It exists, but is either already occupied, or costs seven plus digits.
Admin
We label all of our servers with a project name, project number, and hostname. We don't need labels that say something like "PRODUCTION!" or "This server is SUPER DUPER important". There's an assumption that before you disconnect a server you better know what it's doing and who is using it OR you better find out. Of course, our hostnames indicate whether it is a test, dev, integration, model office, or production server anyways.
Admin
Indeed!
I ran into a good one a while back with a bunch of co-located boxes, with a complete redundant network stack to ensure nothing could go wrong. The hosting company put all the network gear into the same cabinet. And put absolutely everything behind the same circuit breaker.
Needless to say, after a minor power glitch took out every single server at the same time we (a) shouted very loudly at their technical guys, and (b) rapidly arranged to move the whole goddamn lot to a different facility.
Moral of the story is, different companies have different definitions of the word "professional". Don't entrust mission-critical systems to a third party unless you get the chance to double-check everything yourself.
Admin
That happened at a client I was working at. One of their tech guys formated the source control box. In itself that wouldn't have been disasterous other than the fact that they didn't have any real backup procedures in place. Eventualy, after about a week without source control, the same tech managed to recover all the data on the box. Then he had the audacity to be upset that he didn't get praise for recovering the data he had destoryed.
Admin
Seems unlikely that the hardware engineer wouldn't have considered that he removed the production server rather than the test server the minute the problem was detected.
Admin
The REAL WTF is that the hardware engineer had removed the machine before he left for the day (as evidenced by the fact that Michael had to call him back IN), but nobody NOTICED until 2AM. On a "mission critical" system. That means if the hardware eng left at 5pm, it was 9 hrs later. Heck, even if he was working until 10pm that night, it was 4 hrs later...which is pretty bad...
Admin
It was nice of the guy to cover for the hardware engineer's screwup. But, the real WTF is an organization that's so punitive that it's necessary to cover for these mistakes. You KNOW that hardware engineer will never make that kind of mistake again (unless he's dumber than a bucket of gravel).
So, fire him for the screwup, and let your competitor reap the benefit of his hard-won experience. Yeah. That's it.
Admin
On the other hand, this is about the stock market; possibly no one uses the server when the exchanges aren't open.
Admin
First-time mistakes get overlooked... Damn I wish I worked for such an organization... The scope of damage the hardware engineer could have done versus saving his job for being a dumbass...
So what if the server was label-less and unsecure? It is not like the server was placed in the middle of a sidewalk and that everyone could take a jab at it. He's supposedly an ENGINEER, he should have known better. "Err, I, err, isn't this the test server?" stfu. Give him a labeled server and he still would have moved it. He's lost all credibility. He made a grave mistake, it's not like spilling coffee on your desk.
Ignorance is not an excuse.
"Fvck, who knew that big red unlabeled button was the nuke missile launcher? Not my fault, it was not labeled."