Cache Congestion

Recently, we featured the story of Alex, who worked in a little beach town trying to get seasonal work. But Alex isn't the only one with a job that depended entirely on the time of year.

For most seasonal work in IT, it's the server load that varies. Poor developers can get away with inefficient processes for three quarters of a year, only to have it bite them with a vengeance once the right season rolls around. Patrick, a Ruby developer, joined an educational technology company at the height of revision season. Their product, which consisted of two C#/Xamarin cross-platform mobile apps and one Ruby/Rails back-end server, was receiving its highest possible traffic rates. On his first day at the office, the entire tech team was called into a meeting with the CEO, Gregory, to address the problem.

Last year, the dev team had been at a similar meeting, facing similar slowness. Their verdict: there was nothing for it but to rewrite the app. The company had, surprisingly, gone in for it, giving them 6 months with no task but to refactor the app so they'd never face this kind of slowdown again. Now that the busy season had returned, Gregory was furious, and rightly so. The app was no faster than it had been last year.

"I don't want to yell at anyone," boomed Gregory, "but we spent 6 months rewriting, not adding any new features—and now, if anything, the app is slower than it was before! I'm not going to tell you how to do your jobs, because I don't know. But I need you to figure out how to get things faster, and I need you to figure it out in the next 2 weeks."

After he left, the devs sat around brainstorming the source of the problem.

"It's Xamarin," said Diego, the junior iOS Dev. "It's hopelessly unperformant. We need to rewrite the apps in Swift."

"And lose our Android customer base?" responded Juan, the senior Mobile Dev. "The problem isn't Xamarin, it's the architecture of the local database leading to locking problems. All we have to do is rewrite that from scratch. It'll only take a month or so."

"But exam season will be over in a month. We only have two weeks!" cried Rick, the increasingly fraught tech lead.

Patrick piped up, hoping against hope that he could cut through the tangled knot of bull and blame. "Could it be a problem with the back end?"

"Nah, the back end's solid," came the unanimous reply.

When they were kicked out of the meeting room, lacking a plan of action and more panicked than ever, Patrick sidled up to Rick. "What would you like me to work on? I'm a back end dev, but it sounds like it's the front end that needs all the work."

"Just spend a couple of weeks getting to grips with the codebase," Rick replied. "Once exam season is over we'll be doing some big rewrites, so the more you know the code the better."

So Patrick went back to his desk, put his head down, and started combing through the code.

This is a waste of time, he told himself. They said it was solid. Well, maybe I'll find something, like some inefficient sort.

At first, he was irritated by the lack of consistent indention. It was an unholy mess, mixing tabs, two spaces, and four spaces liberally. This seriously needs a linter, he thought to himself.

He tried to focus on the functionality, but even that was suspect. Whoever had written the backend clearly hadn't known much about the Rails framework. They'd built in lots of their own "smart" solutions for problems that Rails already solved. There was a test suite, but it had patchy coverage at best. With no CI in place, lots of the tests were failing, and had clearly been failing for over a year.

At least I found something to do, Patrick told himself, rolling up his sleeves.

While the mobile devs worked on rebuilding the apps, Patrick started fixing the tests. They were already using Github, so it was easy to hook up Travis CI so that code couldn't be merged until the tests passed. He adding Rubocop to detect and correct style inconsistencies, and set about tidying the codebase. He found that the tests took a surprisingly long time to run, but he didn't think much of it until Rick called him over.

"Do you know anything about Elastic Beanstalk auto-scaling? Every time we make a deployment to production, it goes a bit haywire. I've been looking at the instance health, and they're all pushing 100% CPU. I think something's failing out, but I'm not sure what."

"That's odd," Patrick said. "How many instances are there in production?"

"About 15."

Very odd. 15 beefy VMs, all running at > 90% CPU? On closer inspection, they were all working furiously, even during the middle of the night when no one was using the app.

After half a day of doing nothing but tracing the flow, Patrick found an undocumented admin webpage tacked onto the API that provided a ton of statistics about something called Delayed Job. Further research revealed it to be a daemon-based async job runner that had a couple of instances running on every web server VM. The stats page showed how many jobs there were in the backlog—in this case, about half a million of them, and increasing by the second.

How can that work? thought Patrick. At peak times, the only thing this does is make a few jobs per seccond to denormalising data. Those should take a fraction of a second to run. There's no way the queue should ever grow this big!

He reported back to Rick, frowning. "I think I've found the source of the CPU issue," he said, pointing at the Delayed Job queue. "All server resources are being chewed up by this massive queue. Are you sure this has nothing to do with the apps being slow? If it weren't for these background jobs, the server would be much more performant."

"No way," replied Rick. "That might be a contributing factor, but the problem is definitely with the apps. We're nearly finished rewriting the local database layer, you'll see real speedups then. See if you can find out why these jobs are running so slowly in the meantime, though. It's not like it'll hurt."

Skeptical, Patrick returned to his desk and went hunting for the cause of the problem. It didn't take long. Near the top of most of the models was a line like this: include CachedModel. This was Ruby's module mixin syntax; this CachedModel mixin was mixed into just about every model, forming a sort of core backbone for the data layer. CachedModel was a module that looked like this:


module CachedModel
 extend ActiveSupport::Concern

 included do
 after_save :delete_cache
 after_destroy :delete_cache
 end

 # snip 

 def delete_cache
 Rails.cache.delete_matched("#{self.class}/#{cache_id}/*")
 Rails.cache.delete_matched("#{self.class}/index/*")
 # snip
 end
end

Every time a model was saved or deleted, the delete_cache method was called. This method performed a wildcard string search on every key in the cache (ElastiCache in staging and production, flat files in dev and test), deleting strings that matched. And of course, the model saved after every CREATE or INSERT statement, and was removed on every DELETE. That added up to a lot of delete_cache calls.

As an experiment, Patrick cleared out the delete_cache method and ran the test suite. He did a double-take. Did I screw it up? he wondered, and ran the tests again. The result stood: what had once taken 2 minutes on the CI server now completed in 11 seconds.

Why the hell were they using such a monumentally non-performant cache clearing method?! he wondered. Morbidly curious, he looked for where the cache was written to and read using this pattern of key strings and found ... that it wasn't. The caching mechanism had been changed 6 months previously, during the big rewrite. This post-save callback trawled painfully slowly through every key in the cache and never found anything.

Patrick quietly added a pull request to delete the CachedModel module and every reference to it. Once deployed to production, the 15 servers breezed through the backlog processing jobs over the weekend, and then auto-scaled down to a mere 3 instances: 2 comfortably handling the traffic, with another to avoid lag in scaling. There was a noticeable impact on performance of the apps now that more resources were available, as the server endpoints were significantly more responsive. Or at least, the impact was noticeable to Patrick. The rest of the tech team were too busy trying to work out why their ground-up rewrite of the app database layer was benchmarking slower than the original. Before they figured it out, exam season was over for another year, and performance stopped being a priority.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!