• Quite (unregistered)

    Reading this article was like watching a juggernaut coming thundering down a single-lane highway towards a motorbike coming in the opposite direction. You know what's going to happen, you can't look away, and all you're interested in is knowing whether the bike rider is going to walk away.

    Phew. He did, although the juggernaut was completely written off.

  • fyl2xp1 (unregistered)



  • justnacl (unregistered)

    At least the picture of the melted computer was only symbolic, not an actual illustration of what happened.

  • pouzzler (unregistered)

    Once again, it has almost nothing to do with IT, and is the long winded rant of a techie versus management. Hey, techie, yeah, you... you are a techie precisely because politics and management are for fools who can't do a thing with their ten fingers. Don't be surprised or annoyed, nor angry, nor ranting, nor wasting perfectly good thedailywtf space with that same old rant. It never changes, and differing details don't make it even one little bit more interesting.

  • Yazeran (unregistered) in reply to fyl2xp1

    Well what do you expect from a Remy article?? (the usual html comments are there as well).

    As a side note, yes, always test restore in the production environment, but it is usually a good idea to use a non-used /non-open file for the test...... ;-) (The way I do it is to rename one of the files from a few days ago and then restore the original and then doing a diff against the renamed one and deleting that one afterwards)

  • Nicholas "LB" Braden (github)

    The corporate politics in this one make me angry. Not at Remy, but at the fictional characters Remy created. That's the sign of a good writer.

  • Toby Johnson (google)

    "Yeah, I’m gonna need a plan to cycle through every production server. Wipe it, restore from backup tape, and confirm it’s working, yeah?"

    I'm not really understanding why that's considered a terrible idea. Yes, a "real" disaster recovery scenario would hopefully involve backup servers (preferably in a second physical location), but pulling the servers out of the farm one at a time to test them is perfectly reasonable. Non-redundant servers can be restored to a different machine to minimize downtime (or plan around the downtime in the case of specialized hardware, but who does that anymore?).

    There are some issues that can only be found when testing them "for real". Sure, Brendan's execution was terrible, but Stewart should have had no problem making the test successful.

  • Dave (unregistered)

    I'm not calling shenanigans because an amazingly large number of people don't know how to deal with that kind of crap, but for the record it's really not hard. When an idiot-boss says 'let's do this stupid thing', you reply 'very clever, you were testing us to see if we are aware of what a terrible idea that would be because x, y, z industry standards/it's illegal/we don't have the budget for a server farm in orbit, what a clever manager you are, good boy, here's your chew toy'.

  • Carl Witthoft (google) in reply to Dave

    And when the PHB replies "no you sack of moron who only spends money, I mean do it. Like YESTERDAY!!!!!!!!" Then what do you do?

  • Syntaxerror (unregistered) in reply to fyl2xp1

    You must be new here.

  • (nodebb) in reply to Carl Witthoft

    Put the order in writing and ask him to sign it?

  • LH (unregistered) in reply to Yazeran

    That's not how backups work.

  • Roman (unregistered)

    Two days downtime? This is how it ends, when people do their work too much time and have their minds closed.

    This is not how to do it today. See Netflix and their "Chaos Monkey" randomly killing production servers ( http://blog.codinghorror.com/working-with-the-chaos-monkey/ ). The customer should see nothing. That is GOOD DESIGN (tm) for 21st century!

  • Duke of New York (unregistered)


  • (nodebb)

    Testing production is not such a hot idea, but testing to a recovery site is. The Unimportant Clients.

  • Thrud (unregistered)

    Testing production absolutely is a hot idea. Production is the one environment where restore absolutely must work, each time, every time, within the specified time. Testing is the only way to achieve this. My usual approach is to snapshot the VM in question (or make an image if running native) and then in a scheduled and planned outage, restore from backup. If the backup fails, revert to snapshot (or image) fire up, and prod is back. Then and only then investigate why restore failed. Rinse and repeat until success. confirm correct working every 3 or 6 months, and/or after any significant system change.

  • William Crawford (google)

    So just to be clear: The Real WTF is that any emergency takes at least 2 days to recover from, since they need to physically ship the backup tapes to where they can be used. A point which is presumably the point where they were created in the first place. And nobody has a problem with this?

  • Angela Anuszewski (google) in reply to fyl2xp1

    Since you are new here, another tip: When Remy posts, make sure you read the page source.

  • Anonyguest (unregistered)

    So, no one does full recoveries to a sandboxed network VMWare host? Super easy, just need a box with some cheap cheap 4TB sata drives, dump your production environment to it, and done! (Physical servers? PPpht)

  • Andrew (unregistered)

    TRWTF is that Brendan referred to "everyone out here in Melbourne". Anyone who is from Sydney would say, "everyone down here in Melbourne".

  • anotheranonyman (unregistered) in reply to Anonyguest

    I was wondering why people seemed so keen to restore to production servers with associated downtime. Restoring to a copy/clone environment isn't that hard, and no downtime means a lot less hassle.

  • foxyshadis (unregistered)

    You don't specifically have to restore to production, but you DO have to restore to wherever you plan to recover to, on-site or off-site. Otherwise you're only "pretty sure" it's going to work for real. Practicing on a Sunday a few times a year plus having a real-life fire drill once every year or two isn't a terrible idea.

    What I don't get about this story is that they have two large offices, presumably with two separate IT systems... but they appear to have no capability to simply switch one off and fail over to the other, or each work on their own for a while. Who bets their business on Telstra's reliability? One office being nuked from orbit should be a survivable scenario when you're at the stage where you have remote IT.

  • (nodebb) in reply to Andrew

    I caught that, too; but if the culture at the Sydney office is to treat the Melbourne office as a place of exile - a place to send people who are out of favour...

  • John S (unregistered)

    Surely the only WTF is in 'wiping' the machines and not being able to revert quickly if the restore goes wrong.

    If it's a virtual machine then rename the file instead of deleting it. If it's a physical machine then pull the hard disks out and put some blank ones in there for duration of the test.

    If Stewie had been doing his job then he'd have insisted on doing this and nothing terribly bad would have happened. He's mostly responsible for the disaster, not the boss.

  • John S (unregistered) in reply to anotheranonyman

    Restoring to a clone environment means you can only be 99% sure.

    Restoring to production is the only way to get that extra percentage point.

    (nb. numbers were invented...)

  • anotheranonyman (unregistered) in reply to John S

    The cost of downtime can easily outweigh the feel-good factor of that 1% you refer to. And if you've got integration with other systems, rolling back could turn out to be rather harder than you might like.

    I look after testing the restores of moderately complicated systems (bigger than a file server anyway...), and over the years have persuaded people that yes, we can test them well enough without downtime using an appropriately designed environment. It took a bit of persuasion at the beginning, but it's what we're doing now.

    At some point the management dream of doing one big weekend when we test the lot with downtime might happen - but it won't achieve much beyond a lot of meetings, overtime, hair tearing out, and a warm fuzzy glow in the hearts of some people that that tickbox has been ticked.

  • Captain Obvious (unregistered) in reply to Carl Witthoft

    Start looking for a new job when the walking haircut that calls himself a manager makes you do it?

  • Dave (unregistered) in reply to Carl Witthoft
    1. Laugh louder, and add 'nice deadpan - but I know you're joking because only a complete idiot would suggest doing that'.

    2. By the by, but don't ever let a work colleague talk to you like that. If they can't manage the basic politeness due to a human being, you ask them to leave the room and contact HR (or a line manager, if appropriate) about the workplace bullying you're experiencing. It doesn't matter whether it's some snotty-nosed kid or the chief technology idiot, you don't ever let anyone get more than a few words into anything phrased like that. For what it's worth I've had to complain up the chain about such things maybe half a dozen times, and while not once have I had any trouble, three or four times it's resulted in my promotion - it seems to me that employers assume anyone standing up for themselves must be really good if they're willing to risk sticking their neck out, but YMMV.

  • Stewart C (unregistered)

    Original submitter here...

    Just some more info - This was back in the 90's on physical Unix servers. No VMs, no snapshots, no extra disks to restore to. It was wipe everything and hope it restores back again. Not just one file, not just one server... every file on every server including the OS. Including systems which run the General Ledger for the company. With untested backups.

    As the article says, I did raise the issue with local management who were all in agreement. But due to the political situation (Tony thought they were all clowns and was trying to get them sacked) they wouldnt say a word back to Tony or above. The only way I got out of it was to ask for Brendan's signature on it by sending him the 'plan' and asking him to advise the starting date.

    He must have realised that his fingerprints would be on it then and he assigned the task to someone else. It was a disaster, no they didnt have to wait for tapes to come on site, there was no way to restore from backup - they had to rebuild the entire server from scratch. But Tony recovered from it by convincing the business that Melbourne were a bunch of clowns that couldnt do anything right and he was there to fix it.

  • Earlchaos (unregistered)

    Who needs to test backups? Just wait till everything falls apart and look for a new job.


Leave a comment on “Tested Backups”

Log In or post as a guest

Replying to comment #:

« Return to Article