Skill which you don’t use regularly can get rusty. It might not take too much to get the rust off, and remind yourself of what you’re supposed to be doing, but the process of remembering what you’re supposed to do can get a little… damaging.

Lesli spent a big chunk of her career doing IT for an insurance company. They were a conservative company in a conservative industry, which meant they were still rolling out new mainframes in the early 2000s. “Big iron” was the future for insurance.

Until it wasn’t, of course. Lesli was one of the “x86 kids”, part of the team that started with desktop support and migrated into running important services on commodity hardware.

The “big iron” mainframe folks, led by Erwin, watched the process with bemusement. Erwin had joined the company back when they had installed their first S/370 mainframe, and had a low opinion of the direction the future was taking. Watching the “x86 kids” struggle with managing growing storage needs gave him a sense of vindication, as the mainframe never had that problem.

The early x86 rollouts started in 2003, and just used internal disks. At first, only the mail server had anything as fancy as a SCSI RAID array. But as time wore on, the storage needs got harder to manage, and eventually the “x86 kids” rolled out a SAN.

The company bought a second-hand disk array and an expensive support contract with the vendor. It was stuffed with 160GB disks, RAIDed together into about 3TB of storage- a generous amount for 2004. Gradually every service moved onto the SAN, starting with file servers and moving on to email and even experiments with virtualization.

Erwin just watched, and occasionally commented about how they’d solved that problem “on big iron” a generation ago.

Storage needs grew, and more disks got crammed into the array. More disks meant more chances for failures, and each time a disk died, the vendor needed to send out a support tech to replace it. That wasn’t so bad when it was once a quarter, but when disks needed to be replaced twice a month, the hassle of getting a tech on-site, through the multiple layers of security, and into the server room became a burden.

“Hey,” Lesli’s boss suggested, circa late 2005. “Why don’t we just do it ourselves? They can just courier over the new drives, and we can swap and initialize the disk ourselves.”

Everyone liked that idea. After a quick round of training and confirmation that it was safe, that became the process. The support contract was updated, and this became the process.

Until 2009. The world had changed, and Erwin’s beloved “big iron” was declining in relevance. Many of his peers had retired, but he planned to stick it out for a few more years. As the company retired the last mainframe, they needed to reorganize IT, and that meant all the mainframe operators were now going to be server admins. Erwin was put in charge of the storage array.

The good news was that everyone decided to be cautious. Management didn’t want to set Erwin up for failure. Erwin, who frequently wore both a belt and suspenders, didn’t want to take any risks. The support contract was being renegotiated, so the vendor wanted to make sure they looked good. Everyone was ready to make the transition successful.

The first time a disk failed under Erwin’s stewardship, the vendor sent a technician. While Erwin would do all the steps required, the technician was there to train and supervise.

It started well. “You’ll see a red light on the failed disk,” the technician said.

Erwin pointed at a red light. “Like this?”

“Yes, that exactly. Now you’ll need to replace that with the new one.”

Erwin didn’t move. “And I do that how? Let’s go step-by-step.”

The tech started to explain, but went too fast for Erwin’s tastes. Erwin stopped them, and forced them to slow it down. After each step, Erwin paused to confirm it was correct, and note down what, exactly, he had done.

This turned a normally quick process into a bit of a marathon. The marathon got longer, as the technician hadn’t done this for a few years, and was a bit fuzzy on a few of the steps for this specific array, and had to correct themselves- and Erwin had to update his notes. After what felt like too much time, they closed in on the last few steps.

“Okay,” the tech said, “so you pull up a web browser, go to the admin page. Now, login. Great, hit ‘re-initialize’.”

Erwin followed the steps. “It’s warning me about possible data loss, and wants me to confirm by typing in the word ‘yes’?”

“Yeah, sure, do that,” the tech said.

Erwin did.

The tech thought the work was done, but Erwin had more questions. Since the tech was here, Erwin was going to pick their brain. Which was good, because that meant the tech was still on site when every service failed. From the domain service to SharePoint, from the HR database to the actuarial modeling backend, everything which touched the SAN was dead.

“What happened,” Erwin demanded of the tech.

“I don’t know! Something else must have failed.”

Erwin grabbed the tech, Lesli, and the other admins into a conference room. The tech was certain it couldn’t be related to what they had done, so Erwin escalated to the vendor’s phone support. He bulled through the first tier, pointing out they already had a tech onsite, and got to one of the higher-up support reps.

Erwin pulled out his notes, and in detail, recounted every step he had performed. “Finally, I clicked re-initialize.”

“Oh no!” the support rep said. “You don’t want to do that. You want to initialize the disk, not re-initialize. That re-inits the whole array. That’s why there’s a confirmation step, where you have to type ‘yes’.”

“The on-site tech told me to do exactly that.”

The on-site tech experience what must have been the most uncomfortable silence of their career.

“Oh, well, I’m sorry to hear that,” the support rep said. “That deletes all the header information on the array. The data’s still technically on the disks, but there’s no way to get at it. You’ll need to finish formatting and then recover from backup. And ah… can you take me off speaker and put the on-site tech on the line?”

Erwin handed the phone over to the tech, then rounded up the admins. They were going to have a long day ahead getting the disaster fixed. No one was in the room to hear what the support rep said to the tech. When it was over, the tech scrambled out of the office like the building was on fire, never to be heard from again.

In their defense, however, it had been a few years since they’d done the process themselves. They were a bit rusty.

Speaking of rusty, while Erwin continued to praise his “big iron” as being in every way superior to this newfangled nonsense, he stuck around for a few more years. In that time, he proved that he might never be the fastest admin, but he was the most diligent, cautious, and responsible.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.