Bueller?Jerry wasn't the sort of guy who would normally vent frustration out loud at work, yet here he was - cursing into the air at two individuals in particular - the first round of explitives being directed at the toolbag, somewhere, who had botched months of server backups by reusing the same set of tapes for months and the other being a long ago departed developer whose name he was continually being subjected to in the comments of the rotten shell script he was now stepping through.

What had started out as a 7:30am ticket from an early-bird user getting a error message when trying to open a spreadsheet test plan from the week before had turned into a full-on, corporate-wide DEFCON 1.

To make matters worse, Jerry had just delivered his two-week notice a few days prior which meant that in every meeting Jerry was getting "thanked" for the company's current nuclear crisis and that he should have set his little "time bomb" to go off AFTER he was gone. Naturally, while his being "blamed" helped to improve the morale of everyone else, it didn't do much to help Jerry's outlook - especially since it appeared as if this was someone else's "parting gift".

Questions? Please Refer to the Scriptonomicon

For as long as anyone could remember, everyone just kind of just coped with the Bourne shell script that was the framework to a test environment. It was originally designed to run automated tests for a single product, but management was so thrilled at how well it worked that they got other projects to adapt the framework.

Over the next few years, it became the de-facto test framework used by applications throughout the corporation. However, in order to make "one size fit all", it had morphed into something... different. It became one of those gnarly applications that everybody acknowledged was a bit sketchy behind the scenes, but it worked. So long as you stuck to the S.O.P. and knew the different locations where the same value had to be defined and accepted that P_OPERATOR_ID was a unique network identifier that is NOT a normal network ID that you had to get from Chuck in the Infrastructure Group, you'd be ok.

However, recently, the developer who had originally created the framework had left the company in search of greener pastures and, rather than handing off the task of running the scripts to a developer, it was given to a co-op student. After all, running the script was like checking off steps on a list, right? The co-op set up the configuration, scheduled it to run over the weekend, and merrily left it to return the following week. As it turned out, he missed a few details.

Cleaning Up

From a high level, the Bourne script would essentially ssh into each target machine, do its thing, and then exit. As part of its "thing", the designer of the framework wanted to make sure the script cleaned up after itself so subsequent runs of the framework would not re-process old data. To accomplish this, one of the enhancements after the initial release was to add two cryptic variables that (redundantly) contained the project name and the version being tested. Utilizing an unpatched flaw in sudo's setup to gain real root access, the script would then do the following as part of the clean up:

rm -rf $var1/$var2

Ordinarily, this worked just fine, but the co-op student was unaware these SPECIFIC variables needed to be set. With them being left blank, the following was the end result upon execution of the script:

rm -rf /

With the script running as root on a setup with NFS (which, in turn, granted access to everything on the entire UNIX/Linux network and a few Windows Servers via SAMBA), the script had a chance to do a good bit of damage... and it did. Home directories, file repositories, customer data, test results, all seemingly evaporated into nothingness.

All told, it took 6 hours to wipe out the entire network. It took 4 hours to figure out what happened (turns out the script ssh'd onto its own server and the rm -rf wiped out the scripts which did the rm -rf and most the evidence of what happened) and it only took 10 seconds to realize that the latest backups were completely SNAFU'd.

So, as his parting gift, while the most critical drives were being sent off for possible forensic recovery, Jerry was asked to review the test framework and look for any possible flaws where something similar could re-occur. After hitting the 10th instance where deviating from the normal routine would result in some degree of disaster, Jerry knew one thing - even though he had less than two weeks to go, this is one script that would be haunting his nightmares for a long time to come.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!