All Your RAM Are Belong to Us

Back around the turn of the century, governments were a different place to work at. The public trough, while not as fat as it had been, was still capable of providing funding for boondoggles handed out to friends and family. This was before deficit hawks made a sport of picking off small cost overruns that scurried around the fields of government largesse. Before billions was spent on wars of questionable necessity. Before mayors broke down the stereotype that all crack addicts were skinny.

In this heyday, Ray worked for a government department that contracted, managed and passed-through telecommunications services from external providers to other government departments. The department's central billing and administration system was built and run on the Ingres ABF framework and it's origin dated back to the early 90's. What's more, as soon as the application could be put into minimal funding status, it was. Even in the heady Internet bubble days, no money was spent beyond what was needed to keep the application running.

For developers, this meant a heavy reliance on shell scripts and other such tools to support the main application. And, considering the critical nature of the application (it did generate revenue...or at least caused numbers to be moved from one ledger to another within the government), any change went through enough manual testing to defoliated a acre of the Amazon rain forest generating the testing outputs.

So when Ray needed to make a bulk data change to the central database, he followed the prescribed steps. The appropriate shell script was created, followed by multiple runs on the test server to create the 3 type-set, calf leather bound volumes of input-output testing printouts. Once done, 5 levels of sign-off were collected. While there's no question that this was an extreme process (XP, but not in the productive way), by the time Ray ran the procedural gauntlet, he was confident that the script would do what it was advertised to do.

To run these scripts, the developers used one-off AT scripts on the server to schedule it to start after hours on the server in question. This mechanism, along with servers that had a good SMS notification system for failed AT jobs, meant that developers could schedule a script to run and then go home with confidence.

Ray set up him job to run at 6:30pm and with no notification of a failure, it was a sleep-filled evening. And he came in the next morning confident of it being a normal day. The sight of the wide-eyed, slightly perspiring system administrator combined with his opening statement of "Thank god you're here!" extinguished that.

"Fezzik's down!", he said. The servers were named after movie characters and Fezzik was the production server that Ray had scheduled the script on the night before.

"Um...define 'down'." Ray said, stalling while desperately trying to think of what weird permutation in the script could have caused this.

"It's not responding. The network controller says Fezzik's there. We can ping it. But terminal sessions are immediately frozen on connect and the applications running on that server are unreachable."

"So, it's not DOWN down then?" Ray asked as he reversed course and headed to the server room.

"It's down enough", came the reply.

At the server console, the user login shell was visible. The sys admin pushed a key. The server replied with an annoyingly cheerful beep. One key press, no characters, just a beep. The keyboard buffer was full. Ray felt queasy.

"Inconceivable. I have no idea what caused that." Ray said with an honesty that was quickly turning to desperation.

"Well", said the admin, "we did get some e-mails from the system this morning before it stopped responding. What the hell is rous_at_job.sh?"

Ray paused. "Why?"

"There's so many instances of it that we don't KNOW how many instances there are of it!"

Realization and dread in equal measures dawned on Ray. Instead of rous_at_job.sh running rous.sh param1 param2, Ray had instead set rous_at_job.sh to run rous_at_job.sh param1 param2! The script simply invoked itself, recursively, forever. So, for a little over 12 hours, like Agent Smith in the Matrix, rous_at_job.sh had patiently, one Kb at a time, taken over the memory and run-time capabilities of the server. By the time the system administrators had got in in the morning, rous_at_job.sh had successfully completed its quest for electronic domination and had physically run out of space to spawn another process.

The only option was to literally unplug the machine. The only saving grace was the fact that, given the current state of the processes, Ray was pretty certain that the server wasn't actually doing anything. Other than running rous_at_job.sh, that is.

The server came back no worse for wear. Going forward, developers were banned from running ANY job on the production server. Like magic, budget was found for a new data change management and scheduling system. And Ray spent a large percentage of his paycheck at the pub that Friday buying the system administrators beers.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!