If you really think about it, the fact that anything on a computer works is amazing. At a low level, magnets read and write ones and zeros on ridiculously fast rotating platters, and then are assembled into files, which then is stored in memory, which is then passed through a video card and converted into some format that can be displayed on a screen. Throw in networked computers and the potential for signal loss over long distances and the probability that something at some point in the process will fail, and the potential for failure increases exponentially. Maybe I'm alone, but I'm in awe of the fact that my computer doesn't just randomly catch fire and explode.
Of course, we can find and predict many errors, and even alert users that their software or hardware has failed (as long as either the monitor or a speaker is still working). Without any indication of where an error is happening, though, it gets harder to diagnose. Russ recently got to witness a complex issue being diagnosed and resolved firsthand.
Russ works at a mid-sized data processing company that services financial institutions. The company was working to set up internet access to credit reports for their data processing software at a client's office. The hardware guys installed all of the routers and such, then the network guys set up a VPN so they could get into the client's system. But there was a problem.
Occasionally, the connection would drop, and there would be an outage for several minutes. Predictably, and like clockwork, the connection would come back online for a few minutes. Then it'd die again. And so on. The network guys were trying to configure the system, but had to start over each time the connection dropped.
After a half hour of this, the network guys gave up on the configuration and tried to get to the bottom of the issue. Calls to ISP support yielded no leads as everything appeared fine on their end. The network team managed to confirm that the client lost all network and internet access each time it went down. Staff at the client site mentioned that right around the time the network went down, they'd hear a loud, constant buzz on the phone line. The buzzing would stop when the network came back online.
Meanwhile, the hardware team was in the server closet. As soon as they arrived, the source of the outages became apparent. See, along with the soothing, cool air generated by the air conditioning system came the horrible death of switches and other network equipment in the server closet. The air conditioners were on the same circuit as the networking equipment in the server closet. And when I say air conditioners, I don't mean AC to keep servers cool, but the AC to keep employees cool. So, fortunately, this wasn't really a problem during the winter.
When the problem was discovered, the network team took decisive action. Rather than suggesting the client change the circuit that the AC was on, they sent an email to the employees responsible for the client's account.
Credit Report access is installed and tested. NOTE that this circuit only works when the air conditioning at the client's site is off. When the AC is on, the circuit is dead and credit reports will not work.
Russ couldn't confirm the outcome of this story; our best guess is that a software fix was deployed, alerting users that there would be outages every few minutes in late spring, summer, and early fall.