Keepin' It Cool was originally published on October 4, 2006

A few years ago, Phil was working as a developer on a wire transfer application at a large bank. To make sure that nothing technical would prevent the bank from extracting maximum amounts of money from its operations, every part of their system had a redundancy with fast failovers and clustering. In fact, there was even one server (and a backup of that server) whose only function was to monitor the other server and send notifications if anything fell out of the operations norm.

When a system or process failed, the monitoring server would page the on-call support administrator, who would then log in and restore the errant system to its rightful state. On rare occasions, an actual visit to the server room was required.

One summer day, at about two in the morning, the on-call administrator (Mark) was awakened with a Critical Notification Alert that a couple of core processes -- such as the ACH Batch Script and General Ledger Job -- had crashed. Moments later, he was paged again, this time with notification that the fail-over processes had failed.

As Mark attempted to log in to the process control cluster, he received another Critical Notification Alert. And then another. And then several more. The remote access server wasn't responding to his log-in attempts, so Mark got dressed and headed downtown.

About halfway to work, his pager stopped receiving alerts altogether. That was a pretty bad sign, so he dialed into the office to access the company directory and get the numbers of secondary on-call administrators. But the phone lines were dead, too, which could only mean one thing: a bomb, a fire, or a giant robot wreaking havoc throughout the city.

When Mark arrived at work, things were very quiet. He nodded at the security guard and took the elevator to the server room. Mark approached the server room and saw that the automatic secure doors were propped open. As he entered the room, Mark felt a blast of heat and noticed two maintenance employees, both sweating profusely, working on the air conditioner units. This seemed a little strange since Mark should have been the one to call the maintenance crew out there, so he asked how they knew that the air conditioner failed.

HVAC Guy: It didn't fail. We're just changing the chiller bars and doing some other preventive maintenance.
Mark: But we've got two air conditioners in here, why are they both down?
HVAC Guys: We figured it'd be easier to do them both at the same time. Why, is there a problem with that?

Mark and his team of network administrators spent the next 36 hours or so rebuilding and restoring each server, its backup, and its backup's backup. And as for Phil, the developer who submitted this story, he got the day off.
 

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!