Today's 2013 WTF flashback article is a great one written by Remy Porter back on July 30th and is a perfect example of Hanlon's Razor in action.
Steve set aside his Turkish pizza and borek and answered the phone. He was taking lunch around the corner from the office.
“The server is down!” his boss grumbled into the phone. “Where are you? Can you come back in? This is production! Production is down!”
Steve boxed his lunch up and wiped the grease from his fingers, then slogged back to the office. The server was not, in fact, down, but their core application process was. The management panic was inaccurate, but justified. Steve restarted the service before investigating.
It came up, and it stayed up.
Between piping-hot bites of borek, Steve dug through the system. He looked for logs, core dumps, or any other traces of what might have caused the application to crash. Nothing of the kind existed. Since the process stayed up, Steve wrote the issue off as a combination of cosmic rays and butterfly farts in Taiwan. He logged the issue and finished his lunch.
The next day, nothing interesting happened. Nor the day after that. Weeks passed. Then months .
The process went down during lunch again. This time, Steve was already at his desk, enjoying a much healthier lunch, packed up from home. When his boss came in, panicking about production being down, Steve ignored his sandwich and restarted the service.
It came up, and it stayed up. Once again, there were no traces of any error or crash.
The days, weeks and months shambled along. At seemingly random intervals, Steve or one of his co-workers would get a frantic message from their boss: “The server is down!” Sometimes the process died on a weekend. Sometimes they went months without issue, other times the process auto-destructed twice in the same week. Every time, someone was fixing the server while cramming their lunch into their food-hole.
Steve got to thinking. It happened seemingly randomly, but only ever during lunch. As if it were on a schedule…
Steve checked the crontab
file on the production server. Like many production systems, the file was large and stuffed with a huge pile of jobs that were needed to keep everything running like it was supposed to. Steve grepped for jobs scheduled to run sometime during the 12PM hour, and found this one:
12 22 * * * kill 21342
Every day, at 12:22PM, the system killed whichever process had the ID 21342. Most days, that was probably no process. Some days, maybe a user’s shell got the PID. But every once in a while, the great roulette wheel of process scheduling came up “00”, and their main production application got assigned that ID.