Leigh didn’t have anything to do with automating operations at their NOC, although he was mostly glad it had been done. The system was a bit of a mess, with home-grown programs and scripts sitting atop purchased monitoring packages and a CMDB. It was cumbersome, sometimes spit out incomprehensible and nonsense errors, but it mostly worked, and it saved them a huge amount of time.
It was also critical to their operations. Without these tools, without the scripts and the custom database back end, without the intermediary applications and the nice little stop-light dashboard that the managers could see if they hit refresh five times, nothing could get done. Unfortunately, this utopia covered up a dark underbelly.
We’re going to need a meeting
The operations team were the end users of the software, but they mostly relied on the development team to build it and maintain it. The development team relied on the database team. Once, Leigh needed them to expand the size of a single text field in the database from 25 characters to 50 characters. The development team had no problem updating their applications, but the database team wasn’t ready to start changing column sizes right away. Burt, the head of the database team had to start with a 1-hour meeting with his entire team to discuss the implications. Then he had to have another meeting with the development and operations managers. Then Leigh needed to sit down with the DBAs and justify the extra 25 characters (“That’s a 100% increase in the size of the field!” Burt proclaimed). After 200 man hours, the field was changed.
After the database had been in use for some time, performance started to degrade . Leigh helpfully suggested building an index around some of their frequently executed queries. This small pebble triggered an avalanche of meeting requests. Burt wasn’t about to waste a bunch of hard disk space on an index, just because his customer, the developers, and his own DBAs said so. Over thirty meetings were held before Burt grudgingly agreed to allow one index on a few key fields.
The worst day, however, was the day the database went down, at 8AM on the last Friday of the month. Everything the operations team did ground to a complete and total halt. This had cascading effects down the line to application after application, especially the payroll process which needed to ship a gigantic flat-file to the payroll company so that employee would get their checks that month. Every cellphone in the building started beeping out the clarion call of complete disaster as emergency alert after emergency alert went out. The problem went from “solvable crisis” to “full meltdown panic my hair is on fire” in the space of 15 minutes.
Leigh, keeping his cool, started calling DBAs. Not a one answered their phone. Leigh called Burt directly, again with no answer. Leigh’s boss called Burt’s boss, who swore Burt was available, and that he would be found. Leigh texted, emailed, IMed, and stalked the halls, hoping to find one lone DBA that could look at the database. At the end of the first hour, every middle manager in the building was in the hunt. At the end of the second hour, most of upper management had joined them.
The database remained stubbornly down, with it, most of the network operations were shut down.
And then the DBAs reappeared on IM. The underlying problem was a full transaction log, which was solved in moments with a log rotation. Operations resumed. Leigh, however, was curious. He pinged Sally, one of the DBAs, and asked where they had been.
“When the database went down,” Sally replied, “Burt called an emergency meeting. He locked us all in a conference room and didn’t let anyone leave until we had a ‘plan of action’. We weren’t even allowed bathroom breaks!”