When the big banks and brokerages on Wall Street first got the idea that UNIX systems could replace mainframes, one of them decided to take the plunge - Big Bang style. They had hundreds of programmers cranking out as much of the mainframe functionality as they could. Copy-paste was all the rage; anything to save time. It could be fixed later.
Senior management decreed that the plan was to get all the software as ready as it could be by the deadline, then turn off and remove the mainframe terminals on Friday night, swap in the pre-configured UNIX boxes over the weekend, and turn it all on for Monday morning. Everyone was to be there 24 hours a day from Friday forward, for as long as it took. Air mattresses, munchies, etc. were brought in for when people would inevitably need to crash.
While the first few hours were rough, the plan worked. Come Monday, all hands were in place on the production floor and whatever didn't work caused a flurry of activity to get the issue fixed in very short order. All bureaucracy was abandoned in favor of: everyone has root in order to do whatever it takes on-the-fly, no approvals required. Business was conducted. There was a huge sigh of relief.
Then began the inevitable onslaught of add this and that for all the features that couldn't be implemented by the hard cutoff. This went on for 3-4 years until the software was relatively complete, but in desperate need of a full rewrite. The tech people reminded management of their warning about all the shortcuts to save time up front, and that it was time to pay the bill.
To their credit, management gave them the time and money to do it. Unfortunately, copy-paste was still ingrained in the culture, so nine different trading systems had about 90% of their code identical to their peers, but all in separate repositories, each with slightly different modification histories to the core code.
It was about this time that I joined one of the teams. The first thing they had me do was learn how to verify that all 87 (yes, eighty seven) of the nightly batch jobs had completed correctly. For this task, both the team manager and lead dev worked non-stop from 6AM to 10AM - every single day - to verify the results of the nightly jobs. I made a list of all of the jobs to check, and what to verify for each job. It took me from 6AM to 3:00PM, which was kind of pointless as the markets close at 4PM.
After doing it for one day, I said no way and asked them to continue doing it so as to give me time to automate it. They graciously agreed.
It took a while, but I wound up with a rude-n-crude 5K LOC ksh script that reduced the task to checking a text file for a list of OK
/NG
statuses. But this still didn't help if something had failed. I kept scripting more sub-checks for each task to implement what to do on failure (look up what document had the name of the job to run, figure out what arguments to pass, etc., get the status of the fix-it job, and notify someone on the upstream system if it still failed, etc). Either way, the result was recorded.
In the end, the ksh script had grown to more than 15K LOC, but it reduced the entire 8+ hour task to checking a 20 digit (bit-mask) page once a day. Some jobs failed every day for known reasons, but that was OK. As long as the bit-mask of the page was the expected value, you could ignore it; you only had to get involved if an automated repair of something was attempted but failed (this only happened about once every six months).
In retrospect, there were better ways to write that shell script, but it worked. Not only did all that nightly batch job validation and repair logic get encoded in the script (with lots of documentation of the what/how/why variety), but having rid ourselves of the need to deal with this daily mess freed up one man-day per day, and more importantly, allowed my boss to sleep later.
One day, my boss was bragging to the managers of the other trading systems (that were 90% copy-pasted) that he no longer had to deal with this issue. Since they were still dealing with the daily batch-check, they wanted my script. Helping peer teams was considered a Good Thing™, so we gave them the script and showed them how it worked, along with a detailed list of things to change so that it would work with the specifics of their individual systems.
About a week later, the support people on my team (including my boss) started getting nine different status pages in the morning - within seconds of each other - all with different status codes.
It turns out the other teams only modified the program and data file paths for the monitored batch jobs that were relevant to their teams, but didn't bother to delete the sections for the batch jobs they didn't need, and didn't update the notification pager list with info for their own teams. Not only did we get the pages for all of them, but this happened on the one day in six months that something in our system really broke and required manual intervention. Unfortunately, all of the shell scripts attempted to auto correct our failed job. Without. Any. Synchronization. By the time we cleared the confusion of the multiple pages, figured out the status of our own system, realized something required manual fixing and started to fix the mess created by the multiple parallel repair attempts, there wasn't enough time to get it running before the start of business. The financial users were not amused that they couldn't conduct business for several hours.
Once everyone changed the notification lists and deleted all the sections that didn't apply to their specific systems, the problems ceased and those batch-check scripts ran daily until the systems they monitored were finally retired.