It's Thanksgiving here in the US, so we're taking the day off! So, here, enjoy a classic. Cursed and Re-Cursed was originally published on May 7, 2009.
Graham K. was working for an atmospheric chemistry research group in a university in Wollongong, New South Wales, Australia. They'd been running a field experiment in sugar cane fields on a government research farm that was roughly 1500km (~932mi) away in Mackay.
This wasn't their first choice; they were originally based at another site, but it had been quarantined due to a parasite that was discovered in the area, leaving Graham and crew to find another site for their research. Actually, "finding another site" is a bit of an understatement: they had to scramble to scout a location, carefully negotiate with the land owner, convince the locals that the "government experiment" was only about sugar cane, establish local research offices, move all of the field equipment, and relocate the scientists. As you might immagine, field support was foisted on an already overworked team.
At their second site, they decked out a caravan with computer gear and various instruments and set it in the middle of the cane fields. Power was drawn from a nearby water pump shed, and internet access came from a recently-commissioned HSDPA wireless network.
The whole setup was working like a charm. Graham could remotely log in, monitor the status of the instruments, and view the data that had been collected. Maybe, Graham thought, I wouldn't even have to burden the field scientists any further – they certainly had enough work to do anyway, and this last-minute project certainly wasn't going to be high on their list of priorities. Staying a few days more in Mackay and confirming that the system was stable, Graham made his way back down to Wollongong.
Best Laid Plans
The next few months confirmed it – Graham was working on a cursed project. Things had started well, with a solid two weeks of good data and no complications. That is, until a hard drive died, requiring Graham to spend part of his Christmas break making the long journey to Mackay to replace the faulty hardware.
Things continued fine for a few more weeks, until there was suddenly radio silence. No connection, no data access, not even a ping response. A field support scientist discovered that the caravan wasn't receiving any power since someone had crashed into and knocked over a power line several miles away, shutting everything down.
Some time later, Mackay endured devastating floods, which resulted in weeks of lost work while Graham and crew were left to wonder whether their equipment had been washed out to sea. Because of the flooding, they couldn't even send a car out to check on the gear for over a week.
Worse still, the PCMCIA wireless card had a faulty driver, killing the network connection frequently. And when they couldn't connect, they never knew if it was an instrument failing, a power failure, a connection problem, or another car taking out power lines. Calls to field support became more and more frequent, and the delays in their responses grew longer and longer. More often than not, the scientist would just have to hit the reset button and be on their way. Graham wished he had a more elegant solution for the short term, but ultimately decided to make the whole setup auto-reboot daily. It wasn't a huge improvement, but it at least made things slightly better. In spite of everything, though, they were still losing depressing amounts of data due to slow response times when things failed.
Fed up with the constant connection issues, Graham finally bought a shiny new router so that the PCMCIA card and its unreliable driver could be removed from the equation altogether, and that in the event of a connection problem, they would only have to reboot the router. And, miraculously, the curse was broken. Weeks went by with 99% reliability, data was gathered, and the field support crew didn't hear a peep from Graham.
Re-Cursed
Graham had identified all of the obvious problems that can harm a software system – internet connection issues, cars running into things, floods – now it was time for the un-obvious problems.
The hardware and the connection were known to work fine, and he thought the software was similarly reliable, but now he was second guessing himself. He'd built the software over the previous six years, tested it thoroughly in simulations and in the field at other sites, and he'd never seen the odd errors he was getting now from the software. Worse, he couldn't even come up with scenarios that would reproduce these errors. He needed someone up in Mackay to take a look.
The field scientist took a cell phone with him and kept Graham on the line as he drove up to the caravan. The scientist was clearly annoyed, and with good reason. This last-minute project had required him to take many trips back and forth to the caravan, usually just to hit a reset button on something. Graham alternated apologies and thank-yous to the scientist, hearing his footsteps while he walked up to the caravan. As the scientist approached the system, he doubled over in laughter, setting the phone down on the ground.
"Looks like you've got a volunteer researcher helping you on the project," the scientist chortled, holding the phone up to the interloping amateur researcher. Aaa-rk. "There's a tree frog in here, sitting on the keyboard."
The red-eyed tree frog and perhaps some of his froggy friends had found their way into the cool, shady caravan, and turned it into their new haunt, occasionally hopping from one key to another, but usually just hanging out on one key. Graham nicknamed the frog "Y" because this frog was particularly fond of typing "yyyyyyyyyyyyyyyyyy."
After debugging the reptilian de-bugger by installing some mesh around the gear, the system has finally returned to normal. Or rather, not normal, because this was the first time that this installation actually worked reliably. And since then, there have been no major issues.
If there's a lesson to be learned here, it's to build your software to work even in cases where the universe hates you and your project. Always handle error conditions like connection issues, parasites, car accidents, frogs, floods, volcanoes, apocalypses – and that's a tip you won't even find in Code Complete.