In every global organization, there comes a point where someone figures out that all of those servers scattered throughout the planet aren't running at 100% capacity, and that they are sitting there going:

    Got anything for me to do?
    Got anything for me to do?
    Got anything for me to do?
	...

It then occurs to these folks that these otherwise wasted CPU cycles can be leveraged to benefit the firm. Most people then realize that what they should do is buy a farm of machines and have some software to manage processes that are to run on the grid.

Most people.

At H. W.'s company, they designed virtual grids of virtual machines. Each one was carved out of some spare ram and CPU cycles on an underutilized server somewhere in the firm. These could be dynamically allocated on an as-needed basis - within certain constraints. For example, you may only be allowed to get 4GB of RAM in your transient VM.

Once this home-grown grid software was built and was turned loose in production, the team had to justify its effort by having some applications actually use the grid. Since all of the development teams had existing production hardware, they were loathe to risk moving their otherwise stable production systems onto a new virtual platform. This meant that the grid folks could only target new application development.

To this end, they started from the top and applied political pressure downward to use the grid for anything new that was being built.

H. W.'s team was building a replacement system for an aging, difficult to maintain legacy system, and was informed that not only would they not be getting new hardware for their replacement system, but that they had to use the grid.

About 15 other teams were given the same directive.

Fast forward about a year and all of the new systems were ready to deploy on the grid. Each application was comprised of numerous up-front preparatory jobs, the main jobs, and the post-run clean-up jobs. The way you'd access the grid was to make a request for a VM with certain specifications (MB of RAM, number of available cores, access to certain file systems). The grid management software would see what was available at that moment. If what you needed was free, it was yours. If it was not available, your request would block until such time as the resources you needed could be provided.

The justification for this was that these were production jobs that absolutely, positively had to run to completion, so failing to allocate a required resource was simply not an option. If your job needed a machine, you'd get it as soon as it was available. Period!

The various applications were brought on-line, one at a time, independently. Although they fired up their series of jobs at different start times, there was significant overlap.

After a few weeks, several applications had been brought on-line. Then it started. Jobs would periodically, and seemingly randomly take an inordinately long period of time to finish. There were no errors in the application logs. Just incredibly long pauses between log statements. No amount of debugging in any of the applications could find anything wrong.

While this was going on, more applications were brought on-line. The pauses got longer, and jobs were not finishing within their allotted windows. Naturally, this triggered all sorts of redness on various dashboards, and managers started inquiring as to why these brand new applications were failing to complete on time. Again, no amount of debugging in any of the applications could explain the reason for the pauses where all processing within an application appeared to just stop dead.

Finally, it was H. W.'s turn to go live. This application read data from 33 different source systems, and allocated a lot of caches. Since these caches were all larger than the available VM RAM limits, they had to be broken down into sub-range caches (e.g.: A-E, F-J, ...). This forced the allocation of a lot of VM's. During scale testing, the application consistently finished its work in 30 minutes. When run on the grid, it took upward of 4 hours.

Then it happened. By random chance, all the applications on the grid stopped dead at the same time.

At this point, the source of the problem was not to be found in application logs, but in the logs of the grid VM server itself. It turns out that instead of requesting all of the resources that it would need up front, each application would grab VM's as needed. Of course, if application 1 grabbed 10% of the VM's, and application 2 grabbed 10% of the VM's, and ... application 10 grabbed 10% of the VM's, and then each of them needed to grab one more, all of them were blocked while waiting for one of the others to free up a VM. In perpetuity.

Hilarity ensued when each of the development managers demanded that their application run to completion before the next application was allowed to start. Of course, there was no hardware for them to do an emergency migration as the hardware from the legacy systems had been re-purposed after the new applications were deployed.

The grid folks hacked together a change that allowed an application to specify a list of resources it would need up front and insisted that everyone make an emergency change to utilize it. Of course, an application that used a huge amount of resources would block anything else from running. You could also start two small applications leaving most of the grid free, but if there wasn't enough left for a big application, it wouldn't start, even though lots of resources were still available.

There are now 15 massive efforts to try and figure out a way to get each of these applications off the grid.

Photo credit: Foter / CC BY-SA