- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Reading this article was like watching a juggernaut coming thundering down a single-lane highway towards a motorbike coming in the opposite direction. You know what's going to happen, you can't look away, and all you're interested in is knowing whether the bike rider is going to walk away.
Phew. He did, although the juggernaut was completely written off.
Admin
39 UNICORNS & RAINBOWS CREATED
:wtf:?
Admin
At least the picture of the melted computer was only symbolic, not an actual illustration of what happened.
Admin
Once again, it has almost nothing to do with IT, and is the long winded rant of a techie versus management. Hey, techie, yeah, you... you are a techie precisely because politics and management are for fools who can't do a thing with their ten fingers. Don't be surprised or annoyed, nor angry, nor ranting, nor wasting perfectly good thedailywtf space with that same old rant. It never changes, and differing details don't make it even one little bit more interesting.
Admin
Well what do you expect from a Remy article?? (the usual html comments are there as well).
As a side note, yes, always test restore in the production environment, but it is usually a good idea to use a non-used /non-open file for the test...... ;-) (The way I do it is to rename one of the files from a few days ago and then restore the original and then doing a diff against the renamed one and deleting that one afterwards)
Admin
The corporate politics in this one make me angry. Not at Remy, but at the fictional characters Remy created. That's the sign of a good writer.
Admin
"Yeah, I’m gonna need a plan to cycle through every production server. Wipe it, restore from backup tape, and confirm it’s working, yeah?"
I'm not really understanding why that's considered a terrible idea. Yes, a "real" disaster recovery scenario would hopefully involve backup servers (preferably in a second physical location), but pulling the servers out of the farm one at a time to test them is perfectly reasonable. Non-redundant servers can be restored to a different machine to minimize downtime (or plan around the downtime in the case of specialized hardware, but who does that anymore?).
There are some issues that can only be found when testing them "for real". Sure, Brendan's execution was terrible, but Stewart should have had no problem making the test successful.
Admin
I'm not calling shenanigans because an amazingly large number of people don't know how to deal with that kind of crap, but for the record it's really not hard. When an idiot-boss says 'let's do this stupid thing', you reply 'very clever, you were testing us to see if we are aware of what a terrible idea that would be because x, y, z industry standards/it's illegal/we don't have the budget for a server farm in orbit, what a clever manager you are, good boy, here's your chew toy'.
Admin
And when the PHB replies "no you sack of moron who only spends money, I mean do it. Like YESTERDAY!!!!!!!!" Then what do you do?
Admin
You must be new here.
Admin
Put the order in writing and ask him to sign it?
Admin
That's not how backups work.
Admin
Two days downtime? This is how it ends, when people do their work too much time and have their minds closed.
This is not how to do it today. See Netflix and their "Chaos Monkey" randomly killing production servers ( http://blog.codinghorror.com/working-with-the-chaos-monkey/ ). The customer should see nothing. That is GOOD DESIGN (tm) for 21st century!
Admin
BANDWIDTH LEECH, GO SIT IN A CORNER
Admin
Testing production is not such a hot idea, but testing to a recovery site is. The Unimportant Clients.
Admin
Testing production absolutely is a hot idea. Production is the one environment where restore absolutely must work, each time, every time, within the specified time. Testing is the only way to achieve this. My usual approach is to snapshot the VM in question (or make an image if running native) and then in a scheduled and planned outage, restore from backup. If the backup fails, revert to snapshot (or image) fire up, and prod is back. Then and only then investigate why restore failed. Rinse and repeat until success. confirm correct working every 3 or 6 months, and/or after any significant system change.
Admin
So just to be clear: The Real WTF is that any emergency takes at least 2 days to recover from, since they need to physically ship the backup tapes to where they can be used. A point which is presumably the point where they were created in the first place. And nobody has a problem with this?
Admin
Since you are new here, another tip: When Remy posts, make sure you read the page source.
Admin
So, no one does full recoveries to a sandboxed network VMWare host? Super easy, just need a box with some cheap cheap 4TB sata drives, dump your production environment to it, and done! (Physical servers? PPpht)
Admin
TRWTF is that Brendan referred to "everyone out here in Melbourne". Anyone who is from Sydney would say, "everyone down here in Melbourne".
Admin
I was wondering why people seemed so keen to restore to production servers with associated downtime. Restoring to a copy/clone environment isn't that hard, and no downtime means a lot less hassle.
Admin
You don't specifically have to restore to production, but you DO have to restore to wherever you plan to recover to, on-site or off-site. Otherwise you're only "pretty sure" it's going to work for real. Practicing on a Sunday a few times a year plus having a real-life fire drill once every year or two isn't a terrible idea.
What I don't get about this story is that they have two large offices, presumably with two separate IT systems... but they appear to have no capability to simply switch one off and fail over to the other, or each work on their own for a while. Who bets their business on Telstra's reliability? One office being nuked from orbit should be a survivable scenario when you're at the stage where you have remote IT.
Admin
I caught that, too; but if the culture at the Sydney office is to treat the Melbourne office as a place of exile - a place to send people who are out of favour...
Admin
Surely the only WTF is in 'wiping' the machines and not being able to revert quickly if the restore goes wrong.
If it's a virtual machine then rename the file instead of deleting it. If it's a physical machine then pull the hard disks out and put some blank ones in there for duration of the test.
If Stewie had been doing his job then he'd have insisted on doing this and nothing terribly bad would have happened. He's mostly responsible for the disaster, not the boss.
Admin
Restoring to a clone environment means you can only be 99% sure.
Restoring to production is the only way to get that extra percentage point.
(nb. numbers were invented...)
Admin
The cost of downtime can easily outweigh the feel-good factor of that 1% you refer to. And if you've got integration with other systems, rolling back could turn out to be rather harder than you might like.
I look after testing the restores of moderately complicated systems (bigger than a file server anyway...), and over the years have persuaded people that yes, we can test them well enough without downtime using an appropriately designed environment. It took a bit of persuasion at the beginning, but it's what we're doing now.
At some point the management dream of doing one big weekend when we test the lot with downtime might happen - but it won't achieve much beyond a lot of meetings, overtime, hair tearing out, and a warm fuzzy glow in the hearts of some people that that tickbox has been ticked.
Admin
Start looking for a new job when the walking haircut that calls himself a manager makes you do it?
Admin
Laugh louder, and add 'nice deadpan - but I know you're joking because only a complete idiot would suggest doing that'.
By the by, but don't ever let a work colleague talk to you like that. If they can't manage the basic politeness due to a human being, you ask them to leave the room and contact HR (or a line manager, if appropriate) about the workplace bullying you're experiencing. It doesn't matter whether it's some snotty-nosed kid or the chief technology idiot, you don't ever let anyone get more than a few words into anything phrased like that. For what it's worth I've had to complain up the chain about such things maybe half a dozen times, and while not once have I had any trouble, three or four times it's resulted in my promotion - it seems to me that employers assume anyone standing up for themselves must be really good if they're willing to risk sticking their neck out, but YMMV.
Admin
Original submitter here...
Just some more info - This was back in the 90's on physical Unix servers. No VMs, no snapshots, no extra disks to restore to. It was wipe everything and hope it restores back again. Not just one file, not just one server... every file on every server including the OS. Including systems which run the General Ledger for the company. With untested backups.
As the article says, I did raise the issue with local management who were all in agreement. But due to the political situation (Tony thought they were all clowns and was trying to get them sacked) they wouldnt say a word back to Tony or above. The only way I got out of it was to ask for Brendan's signature on it by sending him the 'plan' and asking him to advise the starting date.
He must have realised that his fingerprints would be on it then and he assigned the task to someone else. It was a disaster, no they didnt have to wait for tapes to come on site, there was no way to restore from backup - they had to rebuild the entire server from scratch. But Tony recovered from it by convincing the business that Melbourne were a bunch of clowns that couldnt do anything right and he was there to fix it.
Admin
Who needs to test backups? Just wait till everything falls apart and look for a new job.
lol
Admin
Makes me wonder how do they test their fire evacuation plans. I guess they actually light the building on fire, since a mere drill isn't enough to know if the plan really works. (Ha, captcha to resolve needed me to find pictures with fire hydrants...)