Locator LED

Our anonymous submitter—we'll call him Russell—was a senior engineer supporting an equally anonymous web service that was used by his company's desktop software for returning required data. Russell had a habit of monitoring the service's performance each day, always on the lookout for trouble. One fateful morning, the anomalies piled on thick.

Over the past 24 hours, the host server's average response time had halved, and yet the service was also suddenly dealing with four times as many requests as usual. Average CPU and memory usage on the server had doubled, as had the load on the Oracle host. Even stranger, there was no increase in server errors.

Russell couldn't imagine what might've happened, as no changes had been deployed. However, his product team had recently committed to reducing average server response time. It was possible that someone else had modified an upstream service or some database queries. He emailed the rest of the team and other teams he worked closely with, detailing what he'd seen and asking whether anyone had any pertinent information.

The response from the engineers was basically, Hmm, odd. No, we didn't change anything. The response from the product architects really shouldn't have surprised Russell, given he'd been working in enterprise for nearly 20 years. The reply-all frenzy can be summed up as, You mean we've already fulfilled our commitment to reduce average response time?! LET'S FIRE OFF A SELF-CONGRATULATORY COMPANY-WIDE EMAIL!!!

Upon seeing this, Russell immediately replied: Hold on, let's try to find out what's happening here first.

Unfortunately, he was too late to stop the announcement, but that didn't stop him from investigating further. He remembered that their default monitoring of server errors filtered out 404s. Upon turning off that filter, he found that the number of 404s thrown by the server roughly matched the number of additional requests. Previously, average response time had been around 100ms; at present, it was about 45ms. This "triumph" hid the fact that the numerous 404s were processed in about 10ms each, while the non-404 requests were processed in about 150ms each—50% slower than usual. In other words, the web service's performance had been seriously degraded.

Russell dug further to figure out who was performing this low-key DDoS attack. The requests were authenticated, so he knew the calls were coming from inside the house. He managed to trace them to another product within his company. This product had to make a request to his web service in about 1% of their sessions, but that considerably slowed down their handling of those particular sessions. As a result, someone had modified the product to fire off an asynchronous request to Russell's service for every session, simply ignoring the response if it was a 404.

Russell emailed his findings to his team, but received no reply. Feeling bold, he directly contacted the project manager of the offending product. This led to the biggest WTF of all: the PM apologized and got the change rolled back right away. By the next day, everything was back to normal—but the product architects were angry over the embarrassment caused by their own premature celebration. They were likely also miffed about being forced to find real ways of improving average server response time. Their misplaced ire led to Russell being fired a short time later.

However, our story has a happy ending. The super-responsive product team hired Russell back on after a couple of months, with a 25% pay raise. He retained seniority, and was allowed to keep his former benefits as well as his severance package. In the end, the forces that'd sought to be rid of him had only succeeded in giving him a highly-paid vacation.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!