- Feature Articles
-
CodeSOD
- Most Recent Articles
- Halfway to a Date
- Brushing Up
- Irritants Make Perls
- Crossly Joined
- My Identification
- Mr Number
- intint
- Empty Reasoning
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
"While most computer systems have failure rates on the order of a few days,"
By which you mean the Personal Computer (IBM-PC). In contrast to the minicomputers of the day which ran for years.
Admin
"It's nothing to worry about," Chris's boss told him. "Those things hardly ever have issues. And in the highly unlikely scenario that you'll personally have to deal with one, they've got incredible support to walk you through anything."
Indeed. It's amazing that someone who was told it would be an "unlikely scenario" by the Guru's backup wasn't more prepared.
/Comprehension != needing every detail dictated
Admin
Ummm. In fact when I submitted the story to Alex I did in fact say that I was hired to write software for these guys. So TRWTF is that Alex's re-writing of the software introduced a few errors :P
'Chris B'
Admin
Story. I meant 're-writing of the story' not 'software'
Sigh. It's been a long day.
Admin
The problem with firing this guy is that he is not the root cause of the problem. Always ask yourself 'Why?' five times. Why did the server go down? Cause some guy flipped the wrong switch. Why is this a problem? Because the server didn't come back up in three minutes. Why didn't it come back up in three minutes or so? Because someone else wasn't doing their job. Why weren't they doing their job? (Here's what I think might be the root cause.) Because there wasn't a well established Process to prevent this kind of thing.
So firing this guy doesn't really solve anything, you've found a scapegoat, while not touching on the fundamental problem. Chris' error is just a symptom of a deeper root problem.
Admin
Admin
You must be new here. You have to fire the guy who made that mistake because, obviously, only a completely stupid idiot unworthy of breathing our oxygen could ever do something so impossibly stupid. Also, mercy = weakness.
None of us here have ever made such a mistake, ever, which is why we all still have (and deserve) our jobs. We'd appreciate it if idiots like you would stop cluttering up our forum.*
*Just in case: I'm not being serious.
Admin
Oh, I've seen equally crazy stuff, writ somewhat smaller. Old school unix machines took so little maintenance that they often got relegated to old wiring closets, then quietly forgotten. Especially an infrastructure machine; your secondary DNS machines, your NAT load balancer, your DMZ firewall...All functions well suited to being dumped on a headless machine in the phone room, where they will inevitably gather dust and be forgotten.
This is one of those lessons people need to learn: even if a machine never goes down, you need to bring it down occasionally, in a controlled situation, to make SURE it will come back up in the event of a real failure. Sure, getting 1000+ days back when you type "uptime" is a e-peen extension, but having a machine suffer a catastrophic meltdown because it actually needed to be rebooted later...That's just sloppy.
Admin
Being a software developer with little chance of ever having to handle a server problem, it's expected that you wouldn't already know the process from memory. But, to my mind, that's all the more reason for going slowly, double-checking everything before acting.
But, as someone has said earlier in this thread, once bitten, twice shy. Lesson well learned, eh? Good!
Admin
Admin
What's really sad is that your asterisk is needed
Admin
It seems to me that organizations that have a philosophy of "one mistake and you're fired" are likely to quickly end up with employees who exercise absolutely no initiative. In a case like this of a backup to a backup, the rational response would be to say, "Hey, that's outside my area of expertise, we'd better wait until the regular guy gets back."
Reminds me of a poster I saw years ago of the Real System Development Life cycle:
Admin
Well I don't know for sure, but I'd guess that an "issue" would be because they used a proprietary OS. If you advertise for an XYZ OS guru vs a Unix guru, you're going to get a lot more qualified people for the Unix.
Admin
Some people are on a mission to criticize any word that they simply disagree with. Specifically words that are newer to the lexicon. "Mission" can mean a NASA mission or a big secret government project, and that's it. So it's my mission to find any usage of "mission" or "mission-critical" and point out how wrong it is because obviously you aren't doing anything important enough to qualify as a mission. Just like your production database isn't actually producing anything, so it should really be called your "Final Draft Database".
Admin
Someone seriously needs to learn how to properly use "-".
Admin
Admin
Admin
Anally-retentive.
Admin
If you taped two tandem machines together...
Admin
Admin
Admin
Good job, dude. :)
Don't worry, though.. Boneheaded moves happen to the best of us. I once had to slap a hand away from a reset button at the last second that would have downed an entire hospital, and everything connected/dependent upon it.. (ER doctors not being able to enter/look up patient information = kinda bad. Oh, and when billing goes down? Yeah, the hospital stops making money.) Quite a whoops. And that hand belonged to my team lead, of all people. 20+ years with the company.
Murphy and his Law are both alive and well.
Admin
Admin
...actually, you can "tape" several tandem machines together!
Admin
This is why the rule at several places I've been (I'm an OS developer) was: if you change anything that affects a system's booting or default after-boot state, reboot it at the first opportunity to make sure it boots the way you expect. Hot patches etc. are great, but letting them pile up is a very bad mistake.
Admin
Another is that when Unix became popular in business in the early 1990's, any large system that didn't run Unix (or wasn't capable of running Unix) lost a lot of market share. Tandem never did really recover even though they were an excellent system for the people who needed very high reliability.
[The only large non-Unix systems that seemed to survive the 90's intact were IBM mainframes (which could run Unix as a guest OS) and IBM's AS400 (which usually seem to be installed in companies that don't have large IS departments). Even DEC's VMS didn't survive the 90's in a healthy state]
Admin
The Dell Poweredge servers have a small LCD display that tells you exactly what part is broken if there is a hardware problem. It's pretty sweet. A light on the actual piece would be cool, though, but the extra wiring required for that would be a bit of a pain.
Admin
Mission accomplished.
Admin
Admin
Mind you, software on the Tandem had to be written to take advantage of the NonStop hardware and OS. The original post is an example of an improperly configured system. But when you followed Tandem's guidelines, it was a dream of a machine.
Admin
The actual "list" is about projects in general, not the computer field:
Admin
Do you go on Digg and complain about the lack of articles about shovels, too?
Admin
Do you go on Digg and complain about the lack of articles about shovels, too?
Admin
Well, that goes to teach why there are always two or more computers on hight availability setups, and why you should always restart your computers after big changes.
Getting just a few or several years of uptime from some computer shouldn't make any difference.
Admin
If anyone should have been fired it should not be Chris, but Chris' boss, or that guy's boss, for letting a newbie go on such an important call by himself.
Admin
Redundant power supplies still help if one blows. They don't help when the power goes out, obviously. Would your manager have been cool with buying separate PDUs and wiring the server room with isolated electrical circuits? Or was he the "you've already got one of those, make do" types?
I wouldn't call what he did a major blunder.
Admin
In 1995 I had a job interview at a news and financial information services company that used Tandem computers. The guy interviewing me was so proud of the equipment's ability to withstand hardware failure. He reached over an flipped a switch on the server and their system went dead. The lights on several racks of modems stopped blinking. A major fubar. I did not get that job. A similar failure event happened when someone was showing me how he could hot swap drives on a raid array. Apparently the array was in the middle of rebuilding itself.
Last place I worked they had procedures in place to take down the mainframe every 60 days. The purpose was to test their restart procedures and prevent problems like the one in the story. The mainframe was critical to their business mission.
Admin
Other than verifying I can restore from backups every now and again, I would NEVER tempt the networking gods in such a manner. A scheduled test is one thing, trying to show off is just dumb
Admin
Admin
Unless your company starts writing a telecom grade application that can handle even more then the Tandem and in the ends realizes that with their own OS for multiple CPUs it gets the job done on COTS hardware. No more Tandem needed :)
Admin
I used to joke to my boss that when I went into the plant, I would just hang around the console and if anyone went near it, I would scream "ARE YOU F#$&ING CRAZY?!!!"
Admin
If you are not familiar with the term "Mission Critical" as it applies to software applications, that sir is the real WTF.
Admin
Admin
Fricking awesome. I had fun with a former DIT once. We had a db get corrupt so we needed to restore from a backup (tape backup was all we had at the time). So, I got a broken backup tape and put it back together. So I go into his workstation and as I say "whew...I found the tape" I purposefully fumbled it and it his his desk and broke into pieces. Priceless.
Admin
Well, you would if you'd never tried it and didn't bother putting the slightest effort into thinking before you comment.
I'd imagine that XYZ OS gurus are thin on the ground, but if you advertise for a Non-Stop guru, or a VOS guru, or a TPF guru, the chances are that you'll get a sizeable collection of high-quality applicants. (How this is meant to help you when some wet-behind-the-ears guy comes in and pulls boards without thinking is unclear. In my old, VOS, environment, it was at least a salesman. This was good, because we could extort large chunks of his expense account in drunken revenge.)
Of course, it'll cost you. OS Gurus are different from Enterprise Architects.
If you advertise for a Unix OS guru, you're in the klartz. First of all, you have to sieve through thousands of flavours of *nix. Then you have to sift through thousands of possible combinations of requirements on your own flavour of *nix.
Then you have to face up to the fact that 99% of people who claim to be Unix OS gurus are in fact bare-faced lying morons. And it'll still cost you.
And you won't even get a fault-tolerant system, because you really can't build one of those with standard Unix -- otherwise, the market being the market, somebody would have done so. Double panics all round!
Why people persist in thinking that Unix is anything other than a clapped-out old 1970s OS in dire need of a bullet through the head (both feet already having been self-sacrificed) is beyond me. I use it, but I don't have to admire it.
Admin
Admin
Frankly, if the mini stays up for years and then goes down in the middle of the Melbourne Cup and stays down for several hours, I'm not impressed. Five nines translates to five minutes a year -- guaranteed under non-idiot circumstances.
Admin
Farce, though -- there you may have a point.
By the way -- did anybody comment that 1 is not a prime number yet?
Admin
Prior to software, I once worked for a state criminal justice licensing board. We granted and revoked licenses for police and correctional officers. I had interned at the state agency and then got a part time job after my internship.
Due to a screw-up on my part, I got a guy fired from his new job at a correctional facility. He and his family had moved to my state from up north specifically for this job. Because I misinterpreted the statue and sent a letter rejecting his permanent license (he was hired on a temporary license), the correctional facility let him go.
I didn't get fired - I had a very understanding manager who called the hiring guy at the correctional facility and explained my f***-up. They reinstated the guy as if he'd never left. But I felt like crap - I'd gotten this poor guy fired, probably caused his wife a heart attack. "You moved us down here and then got fired?" When I first found out about my screw-up, I found nice solitary place and bawled like a baby over the fact I had nearly cost someone their career. I honestly don't know if I could have lived with myself if the guy hadn't been rehired.
My whole point? I screwed up big time and got a second chance. I worked there for several years after that and was a great employee (or so I was told). I'm more conscientious about everything I do because of that experience. Once mistake doesn't make you an idiot - it makes you human.
Admin
He didn't say it was better, just that being anything other than Unix meant that the vendor had a lot less potential customers. I don't like Unix much either, but the truth is, by being seen as portable and off-the-shelf to the tech guys, and enterprise-ready by the managers, it took a huge chunk out of any vendor that didn't join the fun. The fact that is neither of those things was, and still is, irrelevant.
When the last Unix box is shut down, it won't be because everyone finally acknowledge what an outdated PITA it is. It will be because Linux is a sparkly new PITA that has all of the same "advantages".
And he's right. There's a huge number of Unix SA's out there. Whether they can walk the walk, especially when it comes to the strange details of a particular variant they haven't used before, is not known or seen by the manager who's deciding what type of system to go with.