Maintenance programming is an important job in nearly any software shop. As developers move on to other positions or companies, the projects they leave behind still need someone to take care of them, to fix bugs or implement customer requests. Often, these products have been cobbled together by a variety of programmers who no longer work for the company, many of whom had only a loose understanding of the product (or even programming in general).
Martin was one such maintenance programmer. After being hired, management quickly realized he had a knack for digging into old systems and figuring them out well enough to update them, which often meant a full rewrite to make everything consistent and sane.
One such system that quickly fell into his lap was essentially a management appliance, a Linux virtual machine (VM) prepackaged with a web-based management interface to control the system. The web application worked well enough but the test suite had…trouble.
The tests used Selenium to deploy a fresh VM and perform some pre-defined actions on the user interface. Most of the test suite was written by a former employee named Jackson who, as far as Martin could tell from his notes and source control commit messages, had very odd assumptions about how things worked, especially involving concurrency.
The test suite had some serious performance issues, as well as a ton of inexplicably failing test cases. The system did not scale up as more VMs were deployed, at all, and Martin uncovered the scary truth that Jackson had wrapped everything in synchronization primitives and routed all actions through a global singleton which stored all state for all VMs. Only one test operation at a time was supported, across all test VMs, forcing them to queue up and run sequentially.
Seeing how all test state was stored in a global singleton, Martin realized that a huge number of the test suite’s failures had to do with leaky state. One test VM would set some state, then give up its lock between tests, providing a small window for another VM to grab the lock and then fail because the state wasn’t valid for that specific test.
He asked around the office to see if anyone knew more about the test system, and though nobody knew the specifics, his coworkers did recall that Jackson was hugely concerned that state would leak between test VMs and cause problems and had spent most of his time designing the system to avoid that. So Martin started reviewing source control history and commit messages, and found that Jackson was ignorant of anything beyond basic programming. Somehow, he believed the singleton would prevent state from being shared. Commit messages spelled it out: “Used a singleton to avoid shared state for concurrency.”
And so Martin spent a few months improving the system by removing the singleton and mutexes, and generally cleaning up the tests’ code. During testing, Jackson’s shared state woes never surfaced, and when Martin was finished the test suite scaled very well by the number of VMs. Most of the spurious test failures simply disappeared and the entire suite ran in a fraction of the time.
And now Martin understood why Jackson was no longer with the company. His solution for dealing with concurrency problems from “potential” shared state was to rewrite the framework to use “assuredly” shared state.