Pratik K. left work on very late on Friday with plans for an awesome weekend. He had earned it. Each application that was part of the new release had been tested in the test database, both individually and as part of an end-to-end drill that included both vertical and horizontal functional and stress testing. His code was bullet proof. All the signoffs had been acquired. The DBAs were ready. The SAs were ready. He could heave a sigh of relief, let go and have some fun. And so on Saturday night, out he went. Drink he did. Late he came home. He got the call at 5AM Sunday morning. Something was horribly, horribly wrong.
The business users had kicked off what they thought would be a 15 minute job early Saturday afternoon. When they came back several hours later, all the applications were hung. The users couldn't access the database. After everything had been so thoroughly tested, how could such a catastrophic failure have happened? What had gone wrong? The customers were already expecting the release to be installed for their use by business-open on Monday morning. This absolutely had to be fixed now!
Pratik tried to shake off the effects of lots of Tequila and related refreshments, and logged in. The network was fine, but he couldn't access the database servers; they simply didn't respond. He got the SAs and DBAs out of bed and had them try to get on the various servers.
The reports came back: CPU usage, memory, swap file usage and paging were at or near 100% for all of the database servers. After awakening assorted layers of managers to get special permission, he was able to attach a debugger to one of the applications and discovered that virtually every thread was blocking on a database call. After having all the processes killed, it was time to walk the code.
After several hours of trying to focus through the Tequila and make sense of code on no sleep, Pratik found the problem. The application suite had been configured such that there were 16 of his low level crunching applications that were called as services to perform various tasks. Each of these would put a list of messages on a queue to the ESB, which would do some more analysis and send individual messages to the end-user application. Pratik's servers were grinding away throwing all sorts of data at the ESB, which in turn, threw all sorts of data at the end-user application. The end-user application had been configured with 500 threads in order to be able to process as much information in parallel as possible.
Of course, if 500 threads try to kick off 500 sets of queries - in parallel - the database server is going to grind to a halt. And, oh by the way, one of their queries was awakening the evil of a known Oracle bug, adding more fuel to the fire.
"But we tested this in the test database!"
Pratik replied that "The test DB had 200 records. The pre production db has 200 million records. The queries will take much longer and use way more resources, so you can't run anywhere near as many of them in parallel."
Pratik told them to reduce the number of threads to 10 and to restart.
"But that can't work; less threads will do less processing, not more!"
After some loud back and forth, the thread count was lowered, the job was rerun and it took all of 30 minutes, which was quite reasonable given the much larger amount of data that was involved.
A few minutes later, Pratik was cc'd on an email instructing all developers to lower the threadcount of all threaded applications to 10. He fired back that this was not wise, as ten is not some magic number; it just happens to match the number of available database connections available for parallel queries - in this case. He received another email blast responding that the development team needs to have a policy on how to precisely determine the correct number of threads to configure for an application. Fearing an extended email war, Pratik bowed out and went back to sleep.