Before Curtis even got to sit down at his desk, he was accosted by a frenzied, sweating junior developer. "OhmygodCurtis," he began. Curtis extended his hand in a "calm the hell down" gesture and allowed him to continue. "A whole bunch of our stores had no data posted last night and I'm not sure why orwhat to doabout it or whoIshouldtalktoand-" Curtis gestured again, to which the developer handed him a thin stack of papers. After a deep breath, the developer continued. "It's a list of the stores that didn't post last night."
The stores in question were part of what we'll call Hewitt & Liberty Block – a reasonably large tax preparation company serving a handful of states with over 2,000 retail locations. Each of the locations was set up to post tax data and sales records to the central computer at the main facility. According to Curtis's list of stores that hadn't posted any data, it was nearly 1/4. It was going to be a long day.
OK, Curtis thought to himself, what would be the simplest possible explanation... some of the stores were closed! Except that he called a few and they were open as usual and had sales that should've posted the previous night. OK, so what's the second simplest possible explanation... the listener stopped working at some point during the night! Paging through the log file, however, there were no signs of the service failing. Third simplest possible explanation... I'm in my own personal hell. So far the best theory Curtis had.
The stores that had posted data seemed to have nothing differentiating them from the stores that hadn't posted data – regardless of how long the location had been there, how close it was to another location that had posted data successfully – there was nothing.
The change control staff repeatedly insisted that while yes, there had been some changes the previous night, they were all properly documented and had gone through the appropriate process. Specifically, a minor firewall change, a database was moved to a different server, and a few stale DNS entries for servers that no longer existed were removed.
Curtis made a few more calls to make sure that the staff was able to get online where necessary, and strangely, all of the stores that hadn't posted any data were able to access the web, while the ones that did post data weren't. Curtis grilled the managers at several of the locations, asking if anyone was there to inspect the equipment, who the ISP was, what connection type they used, which gradually made it clearer – all of the locations on his list used broadband.
So he'd found the common thread, but still, what the hell? Did the software have some kind of built-in mechanism to verify the upload speed and deny it if it was too fast? No, because it wasn't a problem yesterday. Could the "small firewall change" have affected this? Well, no, because most of the locations posted their data correctly, and all stores used the same port. Perhaps the stale DNS entries that got deleted?
In the change logging system, Curtis found the list of DNS entries that had been removed. With a quick search, he found a reference to one of the entries in a utility function. In pseudocode:
function ConnectionTypes DialUpOrBroadband() { var pingResult = util.ping(HQ_SERVER_NAME); if (pingResult == "Ping request could not find host...") { activateModem(); // activates the modem and connects to the internet return ConnectionTypes.Dialup; } else if (pingResult == "Request timed out") { return ConnectionTypes.Broadband; } throw someException; }
And it's not just a coincidence that there's no check for "Reply from x.x.x.x...", as this function was, crazily, written to check against a server that had never even existed. It would simply fail.
Of course, management decided that this was the system admins' fault, for deleting the stale DNS entry. They should've known better!