One of J.W.'s clients called him in to help diagnose and fix some reliability problems with the deployment system. The client was a small shop of about ten developers, one dual-role QA/manager, one SA who controlled the QA and production machines, and the requisite bean counters.

Upon arrival, J.W. was proudly shown the home-grown system that the manager had cobbled together. The developers would fill out a wiki page template for each release. The template had a section for:

  • Configurations
  • Database
  • Network
  • Hardware
  • System Software (e.g. web servers, etc.)
  • Application
  • Miscellaneous
  • List of approvers for each section, as well as the whole release

Each relevant section for a given release had a descriptive paragraph containing the purpose and scope of the change, and a list of the affected files. It also had a section where the developers would put the exact (*nix) commands that had to be used to accomplish the task.

For such a small shop, the procedures seemed fairly solid. Everything had to be built and pass assorted tests and code reviews in development. The SA would perform the exact deployment steps on the QA machines. The manager would put on his QA hat and test everything. The SA would then repeat the deployment process on the production machines, using the exact same commands that had been tested in both development and QA. Finally, the manager would test everything in production.

So what sort of problems were occurring? It turned out that no matter how meticulously the developers had scripted the deployment tasks, things always seemed to go wrong, randomly, in deployments to both QA and production. Even things that worked perfectly while deploying to QA would break during deployments to production, and never twice in the same way or place.

After watching them go through the motions of a release, the developers seemed to be following the procedures correctly and according to plan. The wiki page for each release looked fine to the naked eye.

While loading the deployment wiki page, J.W. happened to notice that there was image alt-text appearing briefly before the script text would appear. But why would there be alt-text where simple text should be?

A quick view-page-source turned up the culprit. The manager wanted to make sure that nobody changed the instructions after they were posted, so he had the developers write up the script text, verify that it worked, then take a screen shot of it and put the image of the script on the wiki page.

The poor SA, somewhat afraid of making a typo while transcribing potentially long sequences of complex commands and data, decided to mitigate the risk by automating the task. How was this feat accomplished? The manager decided that the SA would run OCR on the images of the scripts and data. The SA would then run the output of the OCR.

Of course, the OCR would occasionally frequently misread something and hilarity would ensue.

J. W. explained the folly of all of this to the manager and explained that perhaps simply putting the text of the commands themselves on the Wiki would solve the problem. Unfortunately, the manager insisted that the scripts needed to be protected from unauthorized changes, and using images of them was the only way to do just that...

  JW:  OCR is not reliable enough to do this; that's why you're having all the problems
  SA:  Agreed, but manually typing in large quantities of data and commands is far less reliable
  Mgr: Then the solution is to check the output of the OCR before using it
  JW:  And how do you propose to do that?
  Mgr: Simple: after the OCR is performed, you will run a script to compare it to the scripts in source control
  JW:  Why not just use the scripts directly from source control instead of the OCR?
  Mgr: Because the OCR software was a significant expense and is part of the procedure!

And so it went. The SA would perform OCR on the pictures of the scripts. Then he would run diff on the scripts in source control and the output of the OCR, and resolve all differences to match the source script. This would be repeated until there were no differences. Finally, the SA would run the generated file to perform the requisite tasks.

And the manager was proud of the system he had created.