Joe G. was working for a financial company that had accumulated more than 20 years' worth of code and cruft. This was compounded by management being convinced that the source of all of their technical problems was IT not caring enough about business interests, rather than two decades of short term thinking. They refused to acknowledge the company's technical debt and believed that IT employees' attitudes were solely responsible for their growing reliability problems.

The code was mostly C++ with some Java mixed in here and there for a few legacy front end applications. Joe was tasked with debugging a problem with one such application called TurdEditor. TurdEditor would receive several parameters from a server at startup, then let users tweak those parameters and click an apply button that ostensibly sent the new parameters back to the server. There was also an environment variable called DOMAIN, that if set would cause all company applications to connect to the test servers instead of those in production.

Since Joe wanted to be a good little developer, he set DOMAIN=TEST so that TurdEditor would only connect to the test server while he attempted to reproduce the bug. At this point he noticed that whenever he tried to apply the changes, they didn't take effect. The logs on the test server showed no record of anything being sent to the server. On a hunch, he checked the production server logs and sure enough, there was the record of his settings being applied. Oops! Now Joe had another problem to find and fix before he could get back to his primary mission.

Although not a Java programmer, Joe dove head first into the code and found a very simple callback for the apply button. All it did was execute a shell script called con_serv. Unfortunately, it ssh'd to a hard coded host called 'columbia' and executed something called hump. Although all of the important binaries and scripts were in an NFS folder mounted across all hosts, hump was nowhere to be found. Unsure of what else to do, he ssh'd to columbia, and sure enough, there was hump! There was one problem; hump was in the common NFS folder. But it could only be seen if you were on columbia.

A quick trip to the mount showed that different machines mounted different NFS shares to the same folder. It turns out that the firm had transitioned from SPARC to x86 a number of years earlier, and there were many random scripts and binaries throughout the infrastructure that made hard coded assumptions about where utilities were located. So SPARC boxes used one NFS share that had SPARC binaries in the common folder, and x86 boxes mounted a different NFS share with x86 binaries inside to the same path. And in typical style for the firm, it had never fully finished the transition, and columbia was one of the dozen remaining SPARC hosts.

Now Joe just needed to find the code for it. Of course, the code was not in source control. The binary was so old that people had lost track of it. While upon the mount, Joe approached the guru for his sage advice. The guru had an old copy of the source lying around in his home folder. Hoping that it would resemble the current binary closely enough that it would be insightful, Joe found that he had descended into a quagmire that was nothing short of a Pit-o-WTF™.

In about 3000 lines of C code it managed to connect to the hard coded production database, query it for all of the hosts that had PIDs currently accessing the turd_table, run ANOTHER shell script that would blindly ssh to the first host returned, and do a "kill -HUP" on the associated PID, which luckily in the common case was the production server. The HUP signal would in turn tell the server to reload the database. It turned out that TurdEditor was modifying the database directly and then going through this Rube Goldberg mechanism to tell the server to reload it. But in all of the ssh'ing DOMAIN=TEST was lost, and even if it hadn't been hump didn't know to respect it anyway. So in the end the full path was:

  1. User clicks apply
  2. Changes are committed to database
  3. Hardcoded sparc host is ssh'd to
  4. Sparc binary connects to hardcoded database to find hosts looking at turd_table
  5. Sparc binary launches shell script
  6. Shell blindly ssh's to first result and issues HUP
  7. Server reloads the parameters from the database

So all together clicking apply spawned:

  1. Two database connections (commit change, query for pids)
  2. Hopping across four hosts (desktop, database, columbia, turd server) with two CPU architectures (x86, sparc)
  3. Invoking three custom scripts/utilities (ssh columbia wrapper, hump, ssh turd server wrapper)

Perhaps they also need to implement a system-wide flush.

 

Photo credit: tm-tm / Foter / CC BY-SA

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!