W.T.F. Community College hired a team of highly-recommended web design consultants to bring its website in to the 21st century. Paul was tasked with overseeing their work and supporting the new site upon go-live. After a couple months of grinding, they cranked out a beautiful new site that was accessible, navigable, and responsive. It also removed the old site's dependence on Flash, replacing it with a titanic mound of PHP and JavaScript that was run by Apache.

Paul and his team gave it a thorough beating in their test environment and everything seemed solid. They even gave a demo of it to the the head of W.T.F. Community College, President Skroob. He was able to easily find his way around, sign up, set his password to 1-2-3-4-5, and register for fake classes. It got the Prez's official salute of approval. The amazing new website was ready to be launched in time for the fall semester.

The first day of classes rolled around and suddenly everything didn't seem so rosy. President Skroob frantically burst in to the IT office shouting "The web site is down! The web site is down!" A quick check of the monitoring system didn't show anything wrong, and manual inspection didn't either. The server hosting the site was hardly breaking a sweat. No pressure on CPU, RAM, or I/O. Paul tried browsing to the site and the browser just hung. No errors, no nothing.

"Well that's interesting," Paul said to no one in particular. "We've lost the bleeps, the sweeps, and the creeps."

"Why don't we try restarting Apache?" one of Paul's cohorts suggested. Paul shrugged and did so, and the website immediately returned to normal.

"I don't know what old Indian tribes have to do with our web site, but I'm glad it's working now!" President Skroob stated before ducking out of the office.

Paul set up a simple monitor to hit the home page every 5 minutes to make sure the problem didn't return. But an hour later, the alarm went off. He restarted Apache quickly, and things cleared up. So began a nightmare where someone from Paul's team always had to be available to restart Apache once the website hung. The Prez was not pleased.

"Get a hold of the Apache Chief who made this website and get them in here to fix it NOW!" Skroob shouted at Paul.

Paul got on the horn to the consultants, who hemmed and hawed about checking their code. If they were to be brought back in, it would cost W.T.F. CC an exorbitant sum. Paul decided to look more closely at the server first, and found Apache was running out of processes. He increased the MaxProcesses limit which bought them some time before the site would hang again. All processes were stuck in CLOSE_WAIT status, meaning the client browser had closed the connection but the server hadn't.

A system call trace on a stuck server process showed it was waiting on a file lock, and a stack dump showed the culprit was `mod_php`. Since they were waiting on a lock, they were suspended and couldn't exit. In fact, all the dozens of stuck processes were waiting on the same file: `/var/lib/php/session/sess_111`.

Paul got back on the phone with the consultant team and learned the story behind this big WTF. The new website had feeds (such as current events and the faculty directory) that could be updated dynamically and might be accessed from any session. But to avoid regenerating the output HTML on each load, it was cached so every PHP process could get to it. The consultants had done this using a single, global, monster PHP session.

Thus, every website hit required a server process to read and write that one session, thus locking the session file. If new requests came in at Ludicrous Speed, they were unable to get through the bottleneck and eventually Apache couldn't spawn any more processes. At that point, the site would go catatonic to the user.

"Guys, this is in no way acceptable," Paul berated Apache Chief & Co. "You need to find another approach to caching the data." Meanwhile, Paul wrote a workaround script that ran once a minute and restarted Apache if there were too many stuck processes. Two weeks and thousands of dollars later, the consultants ameliorated the problem by dividing traffic amongst four different cache "sessions".

It worked well enough that Paul's script only got triggered during extremely high usage times. Even then, there was very little downtime. President Skroob put the kibosh on any more of the school's Space Bucks going to the Apache tribe, so getting them to develop something that wasn't just a hack job was out of the question. As long as it worked most of the time and kept Skroob out of their office, Paul was comfortable with leaving well-enough alone.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!