Computer networks 080916-A-QS269-014

It was a lazy, drowsy Saturday afternoon. The sun was shining, birds were singing. The kind of day when children should be playing outside, perhaps running bases in a sandlot someplace, carefree and smiling. Even indoors, thanks to the cost-saving measures at Big Online Retail Store™ HQ, it was warm enough to send tantalizing daydreams of comfortable naps in soft places to the employees working the weekend shift.

Production code pushes, of course, were anything but lazy. There were checks and balances, and the checks and balances had checks and balances. There was tension, and urgency, and the stakes were clear to all involved: don't you dare make a typo or you'll bring the whole company down. Most of the system was automated, and the rest of it was scripted by the developers who nurtured the system like a fussy toddler, willing to cater to its whims if it would just stop crying and let them get some sleep.

On the wall of the Ops War Room, a bright red digital counter ticked down the days: 13 days until Black Friday. Printed memes were tacked to the walls around it, promising dire consequences for screwing up the deployments now. The developers working on hardening the systems were haggard, bleary, frayed around the edges, and, at the time of this story, home sleeping off the previous night's deployment.

But there was a second system at Big Online Retail Store, one that had nowhere near the same oversight: the internal systems that monitored network traffic to the site and pulled out analyses from it. While the bean-counters were counting on the information it would provide during the Christmas season, the lead-up was so far more of a lazy river rafting ride, gently drifting toward the moment when the passengers would disembark and once more have to move under their own power. These systems didn't have teams of people dedicated to every aspect of their existence. Instead, a few developers maintained the dev and production environments, managing the servers themselves, the last vestiges of the maverick mentality that had gotten Big Online Retail Store this far.

On this particular Saturday, our hero Ashton was given the task of commissioning five new servers for the distributed network. They had been provisioned in a data center, shiny and new and ready for production; the system was architected as a series of small components that could be hosted on any machine, so he had about 20 machines to reconfigure in order to spread out the load evenly.

The building was nearly empty at this time of day, making it perfect for zoning out with some monotonous work and some nice, laid-back music. Ashton had definitely had worse Saturdays.

The routine was pretty straightforward:

  • ssh into server
  • check component status; copy in if necessary
  • check configurations
  • check for normal process start up to isolate config issues
  • fix config issues
  • check cron and add any monitoring scripts if needed
  • sudo to root using a new shell
  • open /etc/inittab
  • add the commands to start up the newly installed components
  • drop the ones being shifted elsewhere
  • telinit q
  • hop on to the next box

Fifteen servers in, Ashton's mind was well and truly out the window. Is it warm enough that there'll be a line at the ice cream place? he pondered.

sudo root

Probably. Maybe the deli won't be too bad though.

Paste the new configuration lines.

Hmm, deli ... pickles ... would I rather have fried pickles than ice cream?

dd to remove the old lines.

Definitely fried pickles. All right, I'll finish this up, get some lunch, head home.

telinit 1

Maybe there's still time to wash my car before—hang on a minute.

Ashton had made a horrible mistake.

On a standard QWERTY keyboard, the q key is a hair's breadth away from the 1 key—and Ashton had hit the wrong key. While telinit q would simply instruct the init service to reload the configuration file, telinit 1 would drop the entire system down to runlevel 1: single-user mode, in which certain superfluous services are stopped and only one user at a time can run programs. Superfluous services like, for example, networking. You didn't need it to run, after all, as long as you were sitting in front of the box, typing away at an attached keyboard. As long as you weren't doing something stupid, like controlling the server from your desk via PuTTY while daydreaming about pickles.

Unable to reconnect, Ashton went straight to the Network On-Call person, Tori, messaging him via IM to explain the situation.

"OK, one sec, I'll bring it back up," Tori replied, and Ashton breathed a sigh of relief.

But not for very long. A moment later, Tori came back to the IM: "Hey, where's the bleedin' server? Did you restart it?"

Ashton felt his heart sink. Clearly he hadn't explained well enough. "Um ... I told ya ... I dropped it to run level 1."

"OK..... what was the machine name again?" Tori responded, a few excruciating moments later.

Ashton told him.

What followed was the stuff of legend. We at The Daily WTF are ever conscious of our public availability, and as such, have a moral duty not to repeat the precise text Tori sent in reply to Ashton, as expletives of that sort have been recently classified in some states as class-3 weapons of mass destruction. The string filled Ashton's screen, the verbal equivalent of a full ten-minute rant. It was the sort of rant that the Internet makes possible, wherein the aggrieved party, unhindered by a need to breathe, can go on and on, becoming more and more inventive with each suggestion.

Finally, the rant ceased, culminating in a final cry of dismay: "IT'S IN SALT LAKE CITY!"

Ashton didn't dare think about anything but his work as he finished the last four servers, keeping a nervous eye on Tori's "Do Not Disturb" icon the whole time. Finally, just as he was finishing up the last one, Tori messaged him again: "Called the datacenter. Took half an hour to find the box."

"Thank you!! I owe you!" Ashton replied, relieved, as he pinged the errant server and got a response.

"No. You owe Sam in Salt Lake. And you better believe he'll come collecting."

Ashton swallowed, closing out the IM window. Whatever the fallout might be, he knew he had no choice but to face it with grim determination.

[Advertisement] Incrementally adopt DevOps best practices with BuildMaster, ProGet and Otter, creating a robust, secure, scalable, and reliable DevOps toolchain.