- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
"It was all Leslie could do to keep herself from firing Heather on the spot."
... I'm not sure you should resist there, someone who does that without noticing the problem doesn't belong near critical systems or backups.
Admin
So they are using ZFS to store a (one!) huge file. This does not sound like a very challenging task. Also nothing wrong on the ZFS, why is the consultant labeled as "unhelpful". It turns out he was right, the file was deleted, ZFS functioned as intended.
Admin
TRWTFS: a) The consultant was absolutely right and spot on and reported accuratley what happened b) Hiring experts who know their stuff and then not listening to them c) Doing your backups with a bunch of random bat scripts d) You're holding it (ZFS) completely wrong. Look into snapshots and be amazed at how awesome backups can be and how much of a backup history you can keep for very little cost e) they seem to have an unhealthy bias against unix on servers (which is weird and wrong. installing windows on a production server is like animal cruelty)
Admin
I'm not what the bigger WTF is, not testing the back-up procedure regularly or not reviewing scripts before putting them on prod. I'm guessing that's why Heather wasn't fired instantly, even though that's bloomin' obviously a bad solution, because it never should have happened in the first place.
Admin
At least Heather didn't redirect the backup BOFH style to /dev/null to accelerate the process.
@Frz: the consultant wasn't labelled "unhelpful", it just happened all the help he could offer was indeed very little. I'd agree that "offering [little help]" is not the word here, though.
@JustSomeDudette: agreed. Plus, a backup loses its backup status the moment it can be accessed the same time and/or from the same machine as the original files. So there is the additional WTF they didn't have a second independent backup medium.
Admin
More WTF's that I see
The main server only has 1 hdd for storage with no raid?
a file server is being backed as big image file? (Maybe not an full hdd VM backup) and if so the vm image storage system they have sucks.
And not DFS that is copied live (well you need to buy more windows to do that). There are also other ways to do an file by file backup.
Admin
No, it was RAID 1+0 : striped across an array of mirrors, N drives per mirror. So long as you don't loose more than N-1 drives on any one mirror, you can recover (hopefully before the Nth drive gives up the ghost - because if N-1 are gone, I'd put money on Nth being close to end of life as well).
Admin
The backup server had raid but the main sever only had one hdd?
Admin
Options of first thing to do:
talk to person who oversees backups spend thousands of dollars hiring a consultant to confirm what you yourself already saw in the history of the filesystem where backups are stored (despite no other file having this issue)
Are people retarded or something? Why would you do it in that order? This head of IT just cost the company thousands of dollars because they didn't communicate with their fucking staff before hiring outside the company. What is wrong with you!?
Admin
Nowhere is there any mention of a regular restore of the backups. If you're not including automated restores (with proper checksum verification) on a regular basis you're only testing half the process.
Admin
Of course, then you need to run a test to verify the automated restore is restoring the right thing, by backing it up and running a test to restore the backup of the test restore...
Filed under: " Yo, Dawg, I heard you like big stacks of turtles"
Admin
Very true. Although in this case, a test restoration wouldn't have helped either (unless they did it at 3am, while the backup was running).
The fact that this is a business and nobody had even considered the need for rolling / temporal backups is a true WTF.
Admin
DOS and ZFS?
Admin
I definitely agree--if you can't see how stupid this is you shouldn't be managing backups!
I normally do a write-new-and-then-replace when saving simple working files, let alone backups.
Admin
Agreed. She should have been fired. From a cannon.
Addendum 2016-07-18 13:49: "I'm not sure you should resist there, someone who does that without noticing the problem doesn't belong near critical systems or backups."
Admin
Given the lack of irrelevant details in the story, there is some benefit of the doubt to give. I was once asked to 'setup the nightly backups' and being very young and green assumed to begin with that the weekly and monthly backups were being done separately because I hadn't been asked to sort those out too. Turned out there weren't any, of course, which bounced the priority of the nightly jobs up from 'don't lose a day's admin and logs' - an inconvenience, nothing more - to 'don't lose the entire business'.
I learnt to check more carefully how important things really are, but it wasn't my fault that I was given stupid instructions. To put that in the context of the WTF above, maybe the low-level worker was making a decision about an acceptable level of risk based on faulty data about the importance of the backups.
Admin
Heh reminds me of a gig I did 5 years ago. I could tell they were in a similar issue because their systems were not backed up for 2 weeks on a failing SAN which was about to fill up. I noticed their 'delete' script would run first then backup. The backup failed due to not enough space even after the delete ran, thus deleting the backup and not having room for more.
That same week we had a automated test restore process and server created and differential backups implemented. This stuff happens quite commonly in the wild.
Admin
Big issue: Why didn't anybody review the backup scripts before they went production? First question would be "You do WHAT first?" followed by disbelieving laughter.
Admin
You can't just fire someone on the spot in some states. In California that could end you up with a lawsuit. You can ask them to leave for the day, and fire them the next, but you have to have their full final paycheck ready for them, and you don't know how long your HR group is going to take for that. Usually better to hold your tongue and deal with it immediately afterwards.
Then again, maybe this person was valuable and just needed a little coaching as to why that idea was a tremendous WTF.
Admin
My initital impulse is yes, fire the problem-causer. But when she mentions that it disk space, I kinda wonder if she wrote those scripts to fix a past problem where there were failed backups due to insufficient space.
Admin
If you fire that employee you've just paid a large sum for your competitors to gain an employee who just learnt a valuable lesson.
Admin
"You can't just fire someone on the spot in some states. In California that could end you up with a lawsuit."
Right, but you can fire a consultant on the spot because the consultant diagnosed the problem 100% correctly.
In some countries you can fire an employee on the spot for the same reason.
Admin
When I get called at 3 AM, it's because a monitor has gone red or because a batch has failed, never because Accounting, Design, or the CEO are missing anything. They are all asleep.
Admin
So many fails here, mostly by "Leslie".
First, putting 100 GB files on a remote server is silly. The only reason you'd make a giant tar file like that is if you were writing to tape. Use a proper backup system that de-duplicates. There are free ones such as BackupPC.
No, no, no. Either it was RAIDZ1, or they put ZFS on top of hardware RAID, which is beyond idiotic.
Confirmed, "Leslie" is a complete moron and doesn't know that "scary text on screen" != DOS.
There is nothing anyone could do with ZFS at the Windows command prompt, and if they directly logged into the console of the ZFS server, it would have been BSD/Linux/Solaris, not DOS. Likewise, an SSH session != DOS.
ZFS was in no way part of the problem. In fact if they had been using ZFS correctly and had automated snapshots, then the deleted file could have been recovered.
Admin
You must not have worked at a large corporation. While the CEO is asleep at 3AM the week before the conference, his team (or teams) are still awake working on the presentation deck and more.
Admin
Maybe someone who can tell me why all Erik's stories involve Incompetent IT Women having problems with trivial tasks and all the men tend to either have serious deseases or are being described as being incompetent however most of the time doing nothing wrong?
I mean if you're in IT and something goes wrong you instantly assume the company who delivers one basic service must have screwed up instead of any of your bash script girls wtf.
This story has smells, many smells.
Admin
.. or should you fire the person who should have been supervising her and checking the results of her work?
Uhm... 100Gb is not huge.
Admin
Because Gern is a troll and all of his stories are (poor) works of fiction and falsehood.
Admin
I don't understand how the consultant got cast in the role of bumbling expert in this story when he was correct on all fronts.
And maybe it should have been obvious to ask the person in charge of the backups what the backup process is before "spending thousands" on a consultant?
Admin
"Confirmed, "Leslie" is a complete moron and doesn't know that "scary text on screen" != DOS."
The danger in calling someone a moron is that you may be correct about the existence of a moron, but incorrect about the identity of the moron. The way I read it is that the consultant was complaining under his breath about having to use DOS.
Admin
I wonder if Spinrite would have recovered the failed drive. I didn't catch whether it was a hardware failure. If not, it could have worked!
Admin
TRWTF is that they only keep 1 copy of backup per server. It's not surprising that when something bad happens to the server - like virus infecting files, or ransomewares encrpyting files, the latest backup will be of little use.
Btw, a bit surprised that no story about ransomeware is told here (or there is but I'm not aware of), given there is a few stories related to ransomeware on other MIS related forums already.
Admin
Yeah, its stupid. The file was deleted, and that's what his conclusion was.
If you're familiar with the author you'll know exactly why he was written that way though.
Admin
With ZFS, you can have (very cheap) snapshots every five seconds.
And you can easily send the snapshots to remote location...
Admin
"The consultant was still paid, despite offering little help. His invoice led to upper management reconsidering ZFS for their remote backup solution." They hired a specialist, wasted his time and it's his fault for not being able to help? As I read it, the company failed on multiple levels. The specialist did his job, was spot on with his diagnosis of the problem, and deserves to get paid for his time more than both Heather and Leslie.
Admin
It was a screw up but had she seems to have realized the problem only after the problem occured... Happens a lot in programming. In addition, the fact nobody else reviewed the backup solution is probably a bigger WTF.
Admin
reminds me of an old story: a novice worker misinterpreted the message "this disk must be formatted before use", so EVERY time he inserted a floppy disk into ANY drive for ANY reason, the VERY first thing he ALWAYS did was to reformat it...and his job was to do the daily backups...and nobody noticed the problem until he was told to restore from a backup...