Trust Me

Pritesh C. took a position working for an Architect's Architect who could out-design anything to out-do and out-perform any other system. Everything this guy ever designed and coded had more features, capabilities and brute-force performance than any similar widget ever designed by anyone else, anywhere, ever.

Way back when the system was first being designed, the Architect's Architect decreed that communication with the system was to be via JMS messaging. Additionally, there would be two queues; one for normal and one for high priority messages. Applications requiring the services of this new application would send normal priority messages for batch processing, and high priority messages for direct user-interaction. The application would monitor both queues, but let the high priority messages jump to the head of the line for faster response. Also, due to the system it was to replace being notoriously buggy and unreliable, this system had to be bullet proof. It simply could not go down.

When Pritesh was being trained on the system, he asked the Architect's Architect how the system would respond when this or that error occurred.

The Architect's Architect brazenly claimed that absolutely nothing could cause this system to go down. No error could make it fail. No condition could make it lose control. It was unstoppable. It would be reliable until the end of time. Trust me!

Pritesh had no doubt that the Architect's Architect believed that. Of course, things don't always work out as planned.

One day, Pritesh received a call from the folks who monitored the production systems: "Your logs are growing way out of control. Your application has generated a month's worth of logs in the last 24 hours, and there has been no appreciable increase in load; please check it."

Pritesh logged onto the production server and found a long series of sequentially numbered log files, each approaching 20GB, all from that day! There was no way to open any of the files in an editor. All he could do was grep for Exception. Exceptions spewed forth. It looked as though the application was spitting out exceptions and stack dumps hundreds of times each second. Tailing the log showed that every exception was exactly the same: "Unable to access queue: queue manager appears to have died."

Unfortunately, the queue manager disagreed; it was quite alive and well and processing messages.

In the meantime, customers were escalating performance problems. Managers were coming over to inquire. Ears were smoking. Eyeballs were steeling. Pritesh could feel the death-stare upon him.

Further examination showed that while some of the instances of the application were spinning and generating mountains of exception stack dumps, others were merrily consuming messages and crunching along.

After some digging, Pritesh discovered that the high level code to process messages looked like this:

    Connection          conn;                    // injected via Spring
    Queue               receiveQueueNormal;      // injected via Spring
    Queue               receiveQueuePriority;    // injected via Spring
    ...
    Session session = conn.createSession(true,Session.SESSION_TRANSACTED);
    MessageConsumer consumerNormal = session.createConsumer(receiveQueueNormal);
    MessageConsumer consumerPriority = session.createConsumer(receiveQueuePriority);
    ...
    while (true) {    // AA: This will never fail: trust me!
       try {
             Message msg = consumerPriority.receiveNoWait();
             if (msg == null) {
                 msg = consumerNormal.receiveNoWait();
                 if (msg == null) {
                     continue;
                 }
             }
             // process the message here
       } catch (Throwable t) {
          log.error("...",t);
       }
    }

Further digging in the queue configuration file revealed the problem:

    <policyEntry topic=">" producerFlowControl="true" memoryLimit="1mb">

When the message volume was sufficiently small and there were no messages to process, the code would scream along, checking the queues countless times each second for something to process.

But when the business grew and the message volume had increased just enough, the queue manager ran out of memory and dropped some, but not all of its connections (presumably in self-preservation mode). The applications, designed and written to assume nothing bad could ever happen, were still spinning at CPU-speed, but now they were generating log messages and stack dumps every time through the loop because they couldn't communicate with the queue manager.

There was no reconnect code. No progressive delays between dumping identical log errors. No email to support-lists saying "Bad thing happened!"

Just a lone comment in the source code from the Architect's Architect saying: "Trust me!"

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!