I've frequently posted about my attempts to speed up our system being thwarted by sleepy management decisions about application performance. Our application is essentially: query data, crunch data and save results. Each of those tasks took approximately 1/3 of the run time. A large part of my job is to make the application run more quickly. Every time I fixed something, something else would break, not because I coded something incorrectly, but because of the fragility of our application, other associated applications and the database. For instance...

Sleep for the database

We run four instances of our application. We also perform about 30 load-data queries. Most of them query tables with 3+ billion rows. If all four instances of the app do queries at the same time and each query is run in its own thread, that's 120 parallel queries just from our application server instances. The database simply could not keep up. The original developer created five threads for thirty parallel queries. Then, to add insult to injury, he added sleeps after each query (in each thread) to give the database a chance to catch up. Part of the problem turned out to be that the DBAs only allocated 512M of temp space for all users. They refused to fix it.

Sleep to cool the cpu's

Due to a bunch of data-crunching speedups, the cpu's no longer have a chance to throttle down. The heat built up in the server, a fan controller fried and the cpu's cooked. As an alternative to purchasing redundant cooling, they made me slow the application down to previous performance levels with strategically placed sleeps.

Sleep because the application was lapping itself

As a result of the data-crunching performance enhancement, the application now finished that task in a tiny fraction of the original time. The next phase was to speed up the save operation by replacing thousands and thousands of blocking calls to the db server with a single stored procedure call with one (struct) parameter containing all the data. The db server could then grind through it quickly. This now reduced the save-results portion of the application to nearly instantaneous run time. As such, with 2/3 of the application processing now completing in sub-second intervals, the application was essentially lapping itself and blocking on the queries. Since I was not allowed to rewrite the query part of the application, instead of letting me fix the queries, they made me add sleeps after the sped-up save to simulate the original slow performance to give the database a chance to settle.

Sleep because the clocks aren't synchronized

Inexperienced developers sometimes do something like this:

  private int someFunc(SomeArg arg) {
    // some crunching
    // return value

  // ...

  if (someFunc(arg) == 1) {
     // do stuff for: 1
  } else if (someFunc(arg) == 2) {
     // do stuff for: 2
  } else if (someFunc(arg) == 3) {
     // do stuff for: 3
  } // etc

We found a very deeply nested series of repeated function calls that hit the db to get the same value. When replaced with a single query, a local copy of the result and a local getter it sped up the application throughput so much that when combined with unsynchronized clocks, caused our messages to be received by a downstream system at a time before they were sent (which wrought all sorts of havoc). Rather than fix the synchronization of the clocks, they made us add sleeps to compensate.

Sleep because there's no other way of knowing when an asynchronous event occurs

We found one piece of code that was waiting for an asynchronous event to occur before proceeding. Did they use some sort of latch mechanism? Semaphores? Wait/Notify? Messaging? Nope! Just before the code to process the event, they put several sleeps in a row, each with a comment that an asynchronous event still hadn't happened, so additional wait time was added.

Too much sleeping makes the application appear sleepy

Now we are getting complaints from those same managers that the time that our code requires to insert data into the database is getting to be too much. Erm, we could just reduce and/or remove some of the sleep delays. No, fix the problem! But the problem is in other applications and the lack of geographical data distribution on the database disks; we don't control the other teams or DBAs. That's why the delays are in the code! Remember?

Words get heated.

Emails fly.

Fingers point.

Result: Buy $3MM worth of faster DB servers and hire more junior (e.g.: cheap) developers to work on the software.

Problem: the servers aren't loaded now; it's the fragile timing of the other applications and disk I/O (the disk platters are shared amongst applications and the heads are bouncing all over the place).

We don't need more people; we need better people!