Leslie, head of IT at BlueBox, knew there was trouble when one of her underlings called her at 3AM. “The shared server’s down,” she said. “Disk failure. Accounting can’t issue invoices, design can’t get to its prototypes, and the CEO just lost his PowerPoint for next week’s conference speech.”
BlueBox, like many companies, kept many important documents on a shared server. It also held personal directories for every employee, and many (like the CEO) used it to store personal files. That data, totaling 100 GB, was backed up to a remote server every 24 hours. “Okay, swap out the disk and restore it.”
“I can’t find the backup,” the underling replied.
Leslie groaned, then rolled out of bed, booted her laptop, and RDPed into the remote server. The blood drained from her face: while there were backups of every other server that BlueBox need to operate, the shared server’s was missing.
Bracing for the headache she would face at the office, Leslie made a call to a data recovery specialist. Later that morning, while the shared docs were being salvaged from the failed disk, Leslie prepped for the postmortem.
The Consultant
The remote server held 8 1TB HDDs in RAID 1+0, formatted with ZFS. With that robust configuration, it probably wasn’t be a hardware issue that caused the backup to disappear. It clearly had to be something wrong with the file system.
Naturally, a ZFS consultant was hired.
“I just don’t see how it’s possible for a 100GB file to ‘disappear.’” The consultant addressed Leslie and the rest of IT sat in the conference room. He gestured the air quotes. “ZFS uses copy-on-write transactions. While a file is getting rewritten, the old file data remains on-disk until the operation is completed. If there were a hardware failure during that time, the file-system would fall back to the old file data. It wouldn’t ‘disappear.’”
“We’re paying you a lot of money,” Leslie said. “Why don’t you see for yourself.”
A laptop was brought with an open connection to the server. The consultant grimaced as he opened the DOS command prompt, muttering something about Bash, then ran several commands to check the integrity of the file-system. As he worked, his mouth went agape, cheeks twitching. “No, it’s not possible… This is a fresh file. Are you sure the file wasn’t, well … deleted?”
Leslie sighed. “Thank you for your time. Security will show you out.”
Just Saving Space
After spending thousands on a dead-end, Leslie decided to start with the basics, interviewing every member of IT about the day in question. After grilling several employees on her team, she called in Heather, who oversaw their backup solution.
“There’s a scheduled task to perform the backup on the shared server,” Heather began. “I have it timed for 3AM.”
“That’s close to when the backup failed. Does the scheduled task run a batch script?”
“Yeah.” Heather opened the script on her laptop and showed her.
Leslie’s stomach dropped. “Line 12 … you delete the old backup before creating a new one?”
“I always delete the last backup before I do the next backup,” Heather said. “It helps save space and keeps the hardware optimized. All the other servers are set up that way.”
It was all Leslie could do to keep herself from firing Heather on the spot.
The Solution
Leslie watched as Heather rewrote every backup batch script line-by-line. 7 previous backups would be kept, with new ones written every 24 hours, and old backups would be deleted only after the most recent backup was written. The consultant was still paid, despite offering little help. His invoice led to upper management reconsidering ZFS for their remote backup solution.
A few days afterwards, Leslie got an unexpected visitor. The CEO of BlueBox, effuse with praise, thanked her for finding his PowerPoint before the conference began. He offered a substantial bonus.
Leslie handed the CEO a business card. It had the contact info for the data recovery specialist who salvaged the PowerPoint file from the failed disk. “You ought to give him one, too,” Leslie said, “since he saved your presentation.”