Most technical folks can recognize a developmestuction environment when they encounter one. The less fortunate among us have had one inflicted upon us. However, the one thing they all seem to have in common is that people simply make changes directly in production. I’ve encountered a place that takes the concept to a Whole New Level O’ WTF™.
The company is a huge international conglomerate with regional offices on 5 continents, spread fairly evenly around the globe. The team for this particular project has several folks (developers, testers, QA, UAT and prod support) in each of the locations. Each region is mostly a self-contained installation of servers, databases and end users, but just to make it interesting, some of the data and messaging is shared across regions. Each region runs the normal business hours in its own time zone. As such, at any given time, one region is always doing intra-day processing, one is always in night time quiet-mode, and the other three are in various stages of ramp up, ramp down, or light traffic.
The application is a monstrously large suite of Java applications. Since each region has its own Java support contract, different regions run the same code base on different versions of Java. Accordingly, the automated build process builds the entire tree in each of several different versions of Java. This brings up all sorts of region-specific problems when there’s an issue with a particular release of a JVM. Reproducing them back at development-central inevitably fails because we’re only allowed to install the version of the JVM for which we have a support contract.
For those who aren’t familiar with it, the default Java class loader traverses a path, looking for individually specified files, or jar files that it can search for a given class. It will take the first instance of a class that it finds. This means that if there are two or more different versions of a class on the path, only the first one will ever get loaded, yielding the Java version of dll-Hell.
Someone got the idea that to save money, we could abuse leverage the way the Java class loader works, and only have one set of machines in each region to provide dev, integration, QA, UAT and production environments. The basic premise is that you could introduce short-term-use classes earlier in the class path to ‘patch’ (override) classes that were already there.
The class path search hierarchy was in this order:
$Project/p-dev individual developer class patches for testing
$Project/p-int team class patches for integration testing
$Project/p-qa QA class patches for interim testing between releases to QA
$Project/p-uat user acceptance patches between releases to UAT
$Project/p-prod patches to prod code between formal releases
$Project/jars formal releases
A patch consists of one or more class and/or configuration files (perhaps hundreds) that collectively fix/change/add/delete functionality. The rules for patching were as follows:
- compile your class changes on your PC
- grab all the ‘patch’ files into a dedicated directory tree that mimics the class hierarchy
- install your patch tree only on a region currently in night-mode
- announce the (potentially large list of) classes you’re patching to all other locations
- manually ensure that nobody else is patching the same files
- bounce the target production environment
- do the testing or debugging
- remove your patch files from the deployed tree
- undo all the database changes you made
- bounce the target production environment - again
- announce that you’re done
- The underlying principle is that developer changes are very short term (install, test, remove). Changes to the other patch directories are to last until the next release at 1. that level.
Insane though it may have been, it sort-of worked… for small, localized changes. For changes that touched lots of files across many directories, folks routinely missed some files and needed to patch their patches.
Longer term patches for QA and UAT tended to stick around for a while and so were effectively already in production even though they weren’t formally released. This led to all sorts of interesting situations where users would notice some feature (that was only released to QA) and start using it, only to have it disappear when QA pulled it due to some bug.
However, changes that required structural changes to the database, application or regional interactions tended to have live ripple effects into other regions. As such, the end users were routinely told to expect different event handling, query results or mathematical formulas to change - in production - on an intermittent basis. Sometimes they were told that certain bits of functionality would be unavailable because someone was testing something.
Of course, if more than one person needed to test something, they needed to manually coordinate the patching/bouncing/un-patching/re-bouncing with the other folks.
Occasionally, two folks would need to change different methods of the same class. Since both were testing prior to checkin, neither could see the changes made by the other. This led to a lot of cut-paste-emailing of code in one direction or the other, with subsequent manual undo (source control revert would wipe out both sets of changes for whomever was doing the patching). Naturally, there was the inevitable missed undo that resulted in all sorts of confusion.
One side affect was that if you were changing communication code, the test messages you were sending were on the production network, and depending upon the nature of the message, might be processed in other (un-patched) regions.
An even better side affect was that if you attached a debugger to the process with patched code and hit a breakpoint, you have halted a production process, globally affecting processes that use it. Naturally, the team was advised to avoid using debuggers if at all possible, and if stopped, execution should be resumed as quickly as practical.
The best part was when I noticed test transaction records that were still in the production database and being included in regulatory reports (yeah, that one hit the fan - hard).
When I inquired as to how this could be allowed to happen, let alone flourish, I was told that they didn’t want to make waves by insisting that the company buy enough servers to set up different environments.
Customers will only pay to fix a problem when it becomes more expensive to not fix it than to fix it.
I learned my lesson about trying to fix stupid at WTF Inc…