Reliability is its own, very important art. Unless you're, say, Google, you shouldn't write your own reliability systems, but instead buy solutions from a vendor. Just not this vendor. Original. --Remy
It was a little past 4AM when Massimo's support pager went off, jarring him awake. Without even looking at the pager or logging into his laptop, he flipped on the television to Channel 242: the Video on Demand channel for the Italian TV broadcaster that he worked for.
Nothing.
Grumbling as he got dressed, Massimo knew that, again, now for the third time this month, he was going to have to go into work freakishly early to fix the VoD service.
The server software that was behind the station's Video on Demand service was a curious creature. One of the features that the software package boasted was the Redundancy Manager, a system that could "cluster" the software responsible for managing the conditional access system to the Video On Demand programs. Essentially, if one application server went down, then a back-up server running in stand-by mode would kick in.
Rather than take advantage of fancy-schmancy Windows functionality, the developers of the Redundancy Manager chose to take their own, custom approach.
Under "normal" circumstances where one server would go down, the other one in the station's cluster of ...um...two servers... would pick up where the other left off. More often than not, however, a situation would arise where one server would lose its connection to its other half of the cluster for only a minute. Of course, to the Redundancy Manager, its mate was dead as a doorknob.
With neither systems being aware of the other, they would start fighting for hardware resources which were meant to be accessed by only one server. They would collide, give up due to various errors, then they would try and fail again, stuck in an infinite loop because they had no way of handling such situations. The main server that ran the Redundancy Manger would sit idly by since the mighty "clustering" application only started and stopped the servers on demand.
Time after time, Massimo would be called on the carpet by management because of his alleged inability to administer the servers. He tried to explain the design flaws were not because of him, but the software they purchased that ran on the ugly cluster of servers. Also, he had to explain slowly and carefully every time that even though they had spent the equivalent of more than $100,000 USD for the license, the vendor wasn't going to re-design their software at a customer's request.
Right around 4:45am, Massimo pulled into the TV station's parking lot and was in the server room by 5am. Sleepily, he walked up to the network hub and unplugged the network cable on one server to stop the bleeding. Next, Massimo logged into the master server, one by one,brought down all the applications on the connected server and brought them back up again. After rebooting the final part of the cluster, Massimo clicked "OK" to restore its connection with the Redundancy Manager and wondered what time IT recruiters got into their offices.