Sorry everyone, I don't have a story of massive failure to share with you today. Instead I thought it'd be fun to share this simple story of success -- or at least, success as defined by a vendor at Jamie's company.
A few months back, Jamie's company ran into a Serious Production Issue with their system. I'm sure you're all familiar with the SPI routine: the overhead lights dim; metal sheaths slam down, covering the windows; red SPI lights are activated; an ominous voice repeats "serious production issue" over the PA system; and a special forces team is assembled to solve the problem. At least, that's my company's procedure. Well, with the exception of the PA voice. We hired Majel Barrett (voice of Star Trek's onboard computers) for all of our automated announcements.
The issue with their system lay in its external interface: the Global Logistics Module was erroring out. The failure -- I mean, success -- was caused by a single web service called "Traffic Update." Naturally, no one had any idea what the web service was used for. Not even the vendor did.
To add to the mystery, the Global Logistics Module worked without issue in QA, Test, and Development. It reported that the "Traffic Update" service ran successfully and that they all used the same URL that production used. Jamie, one of the network administrators, logged on to the production server and tried to access the URL ...
$ wget https://edi.initech.com/ext/globlgst/traffic_update --11:17:04-- https://edi.initech.com/ext/globlgst/traffic_update => ext/globlgst/traffic_update Connecting to edi.initech.com:443 ... connected! HTTP request sent, awaiting response ... 403 Not Allowed 11:17:04 ERROR 403: Unknown Service
The service didn't exist. They let the vendor know their findings and shortly thereafter received the following email response:
We've changed the configuration of the production Global Logistics interface to no longer accept HTTP 403 and other failures as success. Your test installations still accept 403. Because you will likely experience several other errors from the service when it fails, we've changed the configuration back.
That explained why they never really had problems with the service before. And fortunately, they won't be experiencing any failures any time soon; only success.