Woof.Shawn G. couldn't believe his eyes. A support ticket had just come in about a user who was having a problem with a DOS-based computer with a 286 processor!? Fresh out of college, Shaw was more used to working with PCs running Windows XP, and processor speeds and RAM amounts in the multi-"giga" ranges. Much to his surprise, he was getting ready to help a user with the equivalent of 1985's cutting-edge top model. Expecting the real problem to be a bug in their help desk software mis-reporting PC's default specs, Shawn gave the user a ring.

It turned out that yes indeed, the computer in question was based on the ancient yet venerable, 286 processor. But the kicker was that the PC wasn't a PC at all. Instead, it was a very expensive, mission critical, ruggedized "beige box" that was used by the engineers to calculate the efficiency of heat transfer for some very large and noisy piece of plant machinery. Despite feeling a little bit out of his element, Shawn figured he'd give a shot at troubleshooting and asked for a description of the problem.

"Well, for a while, it would hang all the time," stated the plant technician.

"About how often?" countered Shawn.

"Oh, I'd say in the neighborhood of about thirty times a day," replied the tech.

As if to anticipate Shawn's jaw reaching for the floor, the tech quickly added, "But we have an automatic solution to cover that. We've got a watchdog."

Beware of Dog

The technician described the watchdog as a little piece of homemade hardware that was connected to the serial port and lived inside of an external drive enclosure. They had cobbled it together for that once-in-a-while when the server would get stuck processing a calculation and need manual intervention to restart it. Basically, every five minutes the computer must "say hello" to the serial watchdog in order to verify that it’s still up and running. Otherwise, the computer would be reset.

In practice, this worked out pretty well and was well received since, after a history of sketchy performance, the number of readings per day were way up and the plant technician wasn't trapped babysitting the computer. However, there was another small problem that came up.

"Yeah, the real problem was that it would get stuck in a rebooting loop. Things got so bad that we had to add another program. One the tenth reboot in a row, it would delete its save state from disk before loading the main program."

A light immediately went off in Shawn's head, "Wait a second - What do you mean by 'saved state' exactly?"

IN UR 'PUTER - SAVIN YER STATE

The plant technician explained that, simply put, the addition of a saved state allowed the computer to pick up processing calculations at the same point it left off prior to being rebooted. With calculations taking about fifteen minutes to finish, the feature was a huge time saver. And, like the watchdog device on the serial port, it was an in-house solution.

"About how long would you say that you've been having the rebooting issue?", Shawn asked.

"Oh, I'd say about three months or so by now," was the reply from the plant technician.

"Ok, and how long has the saved state logic been in place?"

"Hmm, just about as long. What, do you think the state saving is the problem?" asked the tech in a slightly peeved tone.

The logic was beautiful: once a state that had led to a hang caused the computer to hang, that state was saved. The computer would then reboot, load the saved state, hang again, and then save the state again. Over and over and over again.

Despite all the work that went into the saved state feature, no one had really considered what would happen if the system saved a failed state. Nor did anyone think to rollback to a previous version of the processing code once the computer started rebooting. They just kept pasting patch after patch on.

"Actually," Shawn replied, "to be completely honest, I think it works exactly as designed!"