Working freelance for a small, three-person company named StruggleCo, Ivey saw firsthand how they struggled to make do. Part of this involved understanding StruggleCo's terminology for things. For example, though he was retained on an hourly, as-needed basis, Ivey was StruggleCo's Senior Network Engineer. Whenever Gary, the owner, needed to "consult with legal", that actually meant Googling for boiler-plate legal documents. And when Gary said "data center with offsite backup", he really meant "the broom closet and basement."
At least, that had been the case for as long as Iven could remember. But after been fighting a long, uphill battle against some larger competitors, StruggleCo finally landed a huge contract (for them) with a medium-sized company we'll call MediumCo. And this time, they would do things right.
While explaining the MediumCo contact to Ivey, when Gary said "datacenter," he actually meant datacenter. And "server" actually meant server, not some old, cobbled-together workstation running Linux. All Ivey had to do was set-up the server and get their software running on it.
Armed with a nice new 1U system from HP with mirrored 40GB drives, Ivey made the trek out to the facility, which was about two hours away. Installation was a snap, and the automatic update service running on the Linux box was working beautifully. All told, it took less than an hour for the installation and configuration. Ivey set up an automatic backup to send the files to a box at the offsite facility. That is, Gary's basement. After confirming everything was working, Ivey made the trip back home.
The Datacenter and Offsite Backup
Over time, Ivey had all but forgotten about the setup. A year had passed without a word from StruggleCo, until Ivey got a call. It was Gary. "Ivey, we're having big network problems. I need you to go to the hosting facility to investigate. MediumCo's users can't get on the machine."
Connected via VNC, Ivey immediately discovered that there were a bunch of connections that weren't closing as expected. The automatic updates were fine, except for a queue of kernel updates that were downloaded but not installed. This was expected, as those required a reboot, but it was clear that they were the source of the TCP/IP stack errors. From the comfort of his computer chair in his den, he rebooted the machine... only the machine didn't come back online. Wonderful, the bootloader had somehow broken.
A two hour drive later, and he was in the server room, freezing his ass off, trying to figure out what the hell was happening with the machine. After a couple of hours of rebooting, booting from CDs, running grub-install to get the bootloader going again, and banging his face against the keyboard, he had finally gotten things back up and apparently stable. He learned one valuable lesson here – to wait until the weekend/non-peak hours to reboot in case there are errors preventing the machine from coming back online quickly.
Violating a Secret Agreement
Satisfied with a job well done, he sat in the datacenter a while longer to monitor the machine. During this time, he received an email from Gary, forwarded from a VP at MediumCo.
From: Gary <[email protected]> To: Ivey <[email protected]> Ivey, what do you have to say about this? --- Forwarded message --- Our site has been unavailable for several hours today! This is in direct violation of our contract, which requires you to provide 99.99% uptime! I will be discussing this with legal, and I do expect this resolved immediately!
Unbeknownst to Ivey, the contract between StruggleCo and MediumCo promised 99.99% uptime. That is, a little under an hour that the server could be offline each year. That time, as well as the better part of the downtime allowed for the decade was used up that day. Ivey's protests that a promise of 99.99% uptime doesn't make sense with one person working support, hourly, that lives two hours from the datacenter might not be a great idea fell on deaf ears.
"I mean, why not just say 99.9% uptime? That still sounds impressive but gives us a lot more breathing room," Ivey added, aware that it was too late now.
Gary scoffed. "It doesn't matter, ninety-nine point nine, ninety-nine point nine nine, there's such a little difference that it's moot." This, of course, wasn't true. 99.9% gets you almost nine hours, 99.99 gives you less than an hour.
Gary continued down his thread of Gary-logic by throwing his hands up and complaining "well, we had great uptime. It was all working perfectly until now!"
Yeah, well, that's true. Things generally tend to work until... they stop working. if they had written that into the contract, they could've guaranteed 100% availability – that is, during times when the system is available. In fact, they could make that guarantee about every server and application ever created.
Unfortunately, when MediumCo said that they would be "discussing this with legal", they actually meant discussing the matter with their in-house counsel. Things kinda went downhill from there.