Paging Dr. UPS

It was not the ideal way to start a Monday morning. Matt arrived at work to find his boss frantically pacing around the office. “Oh thank God you’re here,” he said as they locked eyes, “the CMB system is down. And Net Ops can’t get bring it back online.”

On the scale of All Things Bad, a downed CMB system falls somewhere between having all employees call in sick for the day and the spontaneous combustion of all computers throughout the company. No CMB meant that the company’s 600+ employees would have to rely on “manual processes” to do their jobs. And that meant that there’d be a lot of unhappy employees, managers, and customers.

Matt headed down to the server room to see what the problem was. He met with Fred, who is one of the contractors from their outsourced Network Operations partner, Lowest Bidder, Inc. “So from what I can tell,” Fred explained, “there was a power outage over the weekend.”

“So,” Matt said in a questioning tone, “why was that a problem? We have a giant UPS and a backup generator.”

“Sure we do,” Fred said, “but the outage lasted longer than eight hours – what the UPS was rated for – and no one was here on Sunday to turn on the generator.”

“But we have a whole alerts system for that! Didn’t anyone receive a call or a page?”

“Dunno,” Fred shrugged, “it wasn’t my turn to carry the pager. I just found out about it this morning when I noticed the Exchange server was down.”

Fred told Matt that another Net Ops jockey, Phil, was currently trying to bring back the CMB system. Matt was starting to get worried. The CMB system is housed on a Solaris server, and Solaris does not take kindly to having the lights go out unexpectedly.

Matt headed over towards the CMB servers to chat with Phil. “Any luck getting CMB back online?”

“Well,” Phil started, “I tried to bring the Suns back up, but I have no idea what’s going on. Some of the drive lights are on. See, this one’s blinking. But this one … dunno. And nothing I do gets it—”

“No! Phil!” Matt interrupted, “No! Don’t!” It was too late. Phil had reached towards the front of the Sun computer’s grill and cycled the power. Visions of a corrupt file system floated in front of Matt as he pleaded with Phil to never touch the big black button on the expensive purple hardware ever again.

“Okay,” Matt said calmly, “we need to get into the ALOM system console. I’m sure there’s some low-level disk check we could use before booting Solaris.”

“Crap,” Phil exclaimed, “that’s could be a problem. We lost the system console password. There’s probably an override… but, crap! There’s another problem. We lost the manual, too.”

Exasperated, Phil had no idea what to do. He just stared blankly at the apparently dead, purple box that housed his company’s critical information system. And then he noticed something: the drive light wasn’t dead. It was just blinking fast. Really, really fast. There was just so much disk activity going on that the status light appeared dim. All that apparent inactivity was, in fact, Solaris running a horribly long boot. That pesky fsck in the boot process was taking forever.

Several hours and few index repairs later, the CMB system finally came back online. Programs were running, users were productive, daemons were … daemoning, and all was well.

But Matt still was unsatisfied with Network Operation’s response to the power outage: someone should have been paged to fix the problem. After a quick investigation, Matt discovered that the alerts server was in fact hooked up to the UPS and did try to call just about everyone on the I.T. call tree. It’s just too bad that the fine folks at Lowest Bidder, Inc. didn’t think to plug the PBX system that runs the phones into the UPS.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!