James works at one of the few companies where maintaining a high-reliability environment is important. Actually, it's more than important, it's required by law. The company is a natural gas supplier and needs to ensure that their Problem Alert & Monitoring (PAM) system is always running. And for good reason: PAM is responsible for coordinating, dispatching, and directing emergency crew to problems like gas leaks and pipeline explosions.

Creating an environment that systems like PAM require is not an easy task. First, there's data backup, both on- and off-site. Then there's the fail-over/clustered server. And the diesel generator. And the backup generator. And, of course, around-the-clock, on-site staff to monitor everything. But their PAM didn't live in such an environment. James' company had a slightly different strategy for reliability: a thirty-minute UPS and a lot of faith.

One of the many quirks about their PAM system is that it doesn't like being shutdown, especially unexpectedly. It uses a completely proprietary database backend that needs to "pack up" the data before it can restart. Though the "pack up" takes only a few minutes to run, skipping it means that a six-to-eight hour database rebuild will be needed when the it starts up again. No, it's certainly not ideal -- even the manufacturer admits to that -- but it's designed to always be on.

Amazingly, the thirty-minute UPS was able to keep PAM running for two years straight. And then there was that extra-warm summer day that led to a forty-minute power outage. PAM went down. Hard. No one was there to "pack up" the database and it took a full eight hours to come back on line.

News of the PAM outage made it all the way up to executive management. And they weren't happy. Meetings were held and immediate action was taken: the UPS was plugged into a server that would send out alerts to James' team to perform a clean shutdown. Sure, they could have followed James advice and installed a gas-turbine generator, but they figured thirty minutes would be plenty of time for some one to sign on and shut down the system.

A whole year went by without incident. There were some power outages here and there, but they never lasted for more than a minute. And then James received The Page. PAM was in danger of going off-line and James had twenty-eight minutes to remotely sign on and shut down the server. There was only one problem: the router wasn't plugged in to the UPS and James was atleast forty-minutes away. PAM went down again. Hard. It took nearly eight hours before they were back online.

Management did not take the news well. If regulators were to find out, they could face serious fines (as in, millions of dollars) and risk being shut down completely. More meetings were held and serious action was to be taken: the router would be put on a dedicated UPS and PAM would be given a beefier, two-hour UPS.

A few weeks pass and the day of the UPS upgrade finally came. The plan was simple: James would sign on, shut the system down, and inform the technician that he could do the upgrade. All in all, PAM would be down for no more than 20 minutes.

Except he couldn't sign on. He couldn't perform the clean shutdown. The UPS technician arrived an hour ahead of schedule and management figured that they might as well get started early. And eight hours later, PAM was finally back on-line.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!