Broken Communication

Rrrrrriiiinnngggg. Ahh, yes. The 2:00AM support call. There’s nothing else quite – rrrrrriiiinnngggg – like it to remind us that no place, not even Happy Dreamland, is – rrrrrriiiinnngggg – a sanctuary from work. “Hun,” Michael’s wife grumbled, “Aren’t you going to get – rrrrrriiiinnngggg – that?!” Michael rolled out of bed and answered the support phone. There was apparently a pretty serious problem with one of the dedicated communication servers.

“Did you check the operations logs?” Michael asked. Surprisingly, the operator had already done that. And attempted to resynching the connection. And cleared the message queue. And everything else on Michael’s step-by-step failure procedure, up to and including calling Michael when all else failed. As pleased as Michael was to find a competent operator working that shift, he still had to go into the office and fix whatever was wrong. The stock exchange was opening in seven hours, and there’d be a lot of unhappy investors if his company couldn’t relay trades.

Michael hastily got dressed, hopped in his car, and began his hour-long commute into the city. On the way in, he dialed up his team’s trusty hardware engineer and told him to meet at the office. The software developers were spared a trip downtown, but not a middle-of-the-night call: they were advised to be prepared to provide support. With forty minutes of driving left, Michael joined the conference call between his company’s operations group and the exchange’s operations group.

“But we’ve run all of the diagnostic programs,” someone chimed in, “we’re simply not seeing any traffic – even validation headers – come in from your end.”

“That’s impossible,” another shouted, “because we’re definitely sending them. Didn’t you see the log files I emailed?”

Michael really had no idea who was whom or even what side they were on, but it was pretty clear that no one wanted to take the blame for the problem. The back-and-forth continued for another fifteen minutes until one of them had a Eureka Moment. “Wait a second,” an operator interrupted, “did anyone actually check if the dedicated line was experiencing problems?”

None of them had, and it certainly was a remote chance that a single line out of dozens would fail, but it presented a great diversion for everyone involved. They could add a third party to the conference call to volley the blame. Namely, the phone company.

By the time each operations group got in touch with their respective phone company representative, Michael had finally arrived at the office. And just in time, too: upper management had been called and advised to prepare for a day without trades.

Michael rushed inside and met with the hardware engineer to figure out what was going on. They put the software in “virtual mode” and started to debug the communications.

Within minutes, they figured out where the problem was. It was their dedicated communications server. For whatever reason, it was not relaying messages to the exchange’s server. As they were unable to remotely access the server (by design, for security purposes), Michael and the hardware engineer took a trip to the datacenter to check things out.

The operations groups were still going back and forth with each other, so they walked over to the communication server’s cabinet and opened the door. It was empty. There was nothing more than an empty shelf and a few cables. Michael immediately thought that the server had been stolen. His mind raced through everything that’d be required to set up one of these new, specialized devices.

“Errrr,” the hardware engineer quietly uttered. Michael turned around to find a confused, ghastly glare. “I, errr,” the engineer stumbled, taking a moment to collect himself, “I thought this is where we keep the test server. The production server… that’s on … on rack 18-B?”

It wasn’t. In fact, rack 18-B was where they kept the test server. The production server, as it turned out, was sitting at the hardware engineer’s desk, about to be configured. Michael and the hardware engineer rushed back to his office and discreetly carried the machine back into the server room. Fortunately, they were far too busy arguing with the exchange’s operation group and the phone company to notice them hooking the server back up.

“Great news guys,” Michael said, barging in to the meeting room, “we’ve got it fixed! We’re up and running again.”

The operation group lead was stunned, “no kidding! What’d you do?”

“Dunno,” Michael shrugged, “we just rebooted it.”

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!