It's not always "fun" bugs and flaws. Earlier this year, we did a deep dive on a much more serious example of what can go wrong.
A few months ago, someone noted in the comments that they hadn't heard about the Therac-25 incident. I was surprised, and went off to do an informal survey of developers I know, only to discover that only about half of them knew what it was without searching for it.
I think it's important that everyone in our industry know about this incident, and upon digging into the details I was stunned by how much of a WTF there was.
Today's article is not fun, or funny. It describes incidents of death and maiming caused by faulty software engineering processes. If that's not what you want today, grab a random article from our archive, instead.
When you're strapping a patient to an electron gun capable of delivering a 25MeV particle beam, following procedure is vitally important. The technician operating the Therac-25 radiotherapy machine at the East Texas Cancer Center (ETCC) had been running this machine, and those like it, long enough that she had the routine down.
On March 21, 1986, the technician brought a patient into the treatment room. She checked their prescription, and positioned them onto the bed of the Therac-25. Above the patient was the end-point of the emitter, a turntable which allowed her to select what kind of beam the device would emit. First, she set the turntable to a simple optical laser mode, and used that to position the patient so that the beam struck a small section of his upper back, just to one side of his spine.
By Ajzh2074 - Own work, CC BY-SA 4.0, Link
With the patient in the correct position, she rotated the turntable again. There were two other positions. One would position an array of magnets between the beam and the patient; these would shape and aim the beam. The other placed a block of metal between the beam and the patient. When struck by a 25MeV beam of electrons, the metal would radiate X-rays.
This patient's prescription was for an electron beam, so she positioned the turntable and left the room. In the room next door, shielded from the radiation, was the control terminal. The technician started keying in the prescription to begin the treatment.
If things were exactly following the routine, she'd be able to communicate with the patient via an intercom, and monitor the patient via a video camera. Sadly, that system had broken down today. Still, this patient had already had a number of treatments, so they knew what to expect, so that communication was hardly necessary. In fact, the Therac-25 and all the supporting equipment were always finicky, so "something doesn't work" practically was part of the routine.
The technician had run this process so many times she started keying in the prescription. She'd become an extremely fast typist, at least on this device, and perhaps too fast. In the field for beam type, she accidentally keyed in "X", for "x-ray". It was a natural mistake, as most patients got x-ray treatments, and it wasn't much of a problem: the computer would see that the turntable was in the wrong position and refuse to dose the patient. She quickly tapped the "UP" arrow on the keyboard to return to the field, corrected the value to "E", for electron, and confirmed the other parameters.
Her finger hovered over the "B" key on the keyboard while she confirmed her data entry. Once she was sure everything was correct, she pressed "B" for "beam start". There was no noise, there never was, but after a moment, the terminal read: "Malfunction 54", and then "Treatment Pause".
Error codes were no surprise. The technicians kept a chart next to the console, which documented all the error codes. In this case, "Malfunction 54" meant a "dose input 2" error.
That may not have explained anything, but the technician was used to the error codes being cryptic. And this was a "treatment pause", which meant the next step was to resume treatment. According to the terminal, no radiation had been delivered yet, so she hit the "P" key to unpause the beam.
That's when she heard the screaming.
The patient had been through a number of these sessions already, and knew they shouldn't feel a thing. The first time the technician activated the beam, however, he felt a burning sensation, which he later described like "hot coffee" being poured on his back. Without any intercom to call for help, he started to get off the treatment table. He was still extricating himself, screaming for help, when the technician unpaused the beam, at which point he felt something like a massive electric shock.
That, at first, was the diagnosis. A malfunction in the machine must have delivered an electric shock. The patient was sent home, and the hospital physicist examined the Therac-25, confirming everything was in working order and there were no signs of trouble. It didn't seem like it would happen again.
The patient had been prescribed a dose of 180 rads as part of a six-week treatment program that would deliver 6,000 rads in total. According to the Therac-25, the patient had received an underdose, a fraction of that radiation. No one knew it yet, but the malfunction had actually delivered between 16,000 and 25,000 rads. The patient seemed fine, but in fact, they were already dead and no one knew it yet.
The ETCC incident was not the first, and sadly was not the last malfunction of the Therac-25 system. Between June 1985 and July 1987, there were six accidents involving the Therac-25, manufactured by Atomic Energy Canada Limited (AECL). Each was a severe radiation overdose, which resulted in serious injuries, maimings, and deaths.
As the first incidents started to appear, no one was entirely certain what was happening. Radiation poisoning is hard to diagnose, especially if you don't expect it. As with the ETCC incident, the machine reported an underdose despite overdosing the patient. Hospital physicists even contacted AECL when they suspected an overdose, only to be told such a thing was impossible.
A few weeks later, there was a second overdose at ETCC, and it was around that time that the FDA and the press started to get involved. Early on, there was a great deal of speculation about the cause. Of interest is this comment from the RISKS mailing list from 1986.
Here is my speculation of what happened: I suspect that the current in the electron beam is probably much greater in X-ray mode (because you want similar dose rates in both modes, and the production of X-rays is more indirect). So when you select X-rays, I'll bet the target drops into place and the beam current is boosted. I suspect in this case, the currentwas boosted before the target could move into position, and a very high current electron beam went into the patient.
How could this be allowed to happen? My guess is that the software people would not have considered it necessary to guard against this failure mode. Machine designers have traditionally used electromechanical interlocks to ensure safety. Computer control of therapy machines is a fairly recent development and is layered on top of, rather than substituting for, the old electromechanical mechanisms.
The Therac-25 was the first entirely software-controlled radiotherapy device. As that quote from Jacky above points out: most such systems use hardware interlocks to prevent the beam from firing when the targets are not properly configured. The Therac-25 did not.
The software included a number of key modules that ran on a PDP-11. First, there were separate processes for handling each key function of the system: user input, beam alignment, dosage tracking, etc. Each of these processes was implemented in PDP-11 Assembly. Governing these processes was a real-time OS, also implemented in Assembly. All of this software, from the individual processes to the OS itself, were the work of a single software developer.
AECL had every confidence in this software, though, because it wasn't new. The earliest versions of the software appeared on the Therac-6. Development started in 1972, and the software was adapted to the Therac-25 in 1976. The same core was also used on the Therac-20. Within AECL, the attitude was that the software must be safe because they'd been using it for so long.
In fact, when AECL performed their own internal safety analysis of the Therac-25 in 1983, they did so with the following assumptions:
1) Programming errors have been reduced by extensive testing on a hardware simulator, and under field conditions on teletherapy units. Any residual software errors are not included in the analysis. 2) Program software does not decay due to wear, fatigue, or reproduction errors. 3) Computer software errors are caused by faulty hardware components, and "soft" (random) errors induced by alpha particles or electromagnetic noise.
In other words: we've used the software for a long time and software always copies and deploys perfectly. So, any bugs we see would have to be transient bugs caused by radiation or hardware errors.
After the second incident at ETCC, the hospital physicist took the Therac-25 out of service and worked with the technician to replicate the steps that caused the overdose. It wasn't easy to trigger the "Malfunction 54" error message, especially when they were trying to methodically replicate the exact steps, because as it turned out, if you entered the data slowly, there were no problems.
To trigger the overdose, you needed to type quickly, the kind of speed that an experienced operator might have. The physicist practiced until he could replicate the error, then informed AECL. While he was taking measurements to see how large the overdoses were, AECL called back. They couldn't replicate the issue. "It works on my machine," essentially.
After being coached on the required speed, the AECL technicians went back to it, and confirmed that they could trigger an overdose. When the hospital physicist took measurements, they found roughly 4,000 rads in the overdose. AECL, doing similar tests, triggered overdoses of 25,000 rads. The reality is that, depending on the timing, the output was potentially random.
With that information, the root cause was easier to understand: there was a race condition. Specifically, when the technician mistyped "X" for x-ray, the computer would calculate out the beam activation sequence to deliver a high-energy beam to create x-rays. When the technician hit the "UP" arrow to correct their mistake, it should've forced a recalculation of that activation sequence—but if the user typed too quickly, the UI would update and the recalculation would never happen.
By the middle of 1986, the Food and Drug Administration (FDA) was involved, and demanded that AECL provide a Corrective Action Plan (CAP). What followed was a lengthy process of revisions as AECL would provide their CAP and the FDA would follow up with questions, resulting in new revisions to the CAP.
For example, the FDA reviewed the first CAP revision and noted that it was incomplete. Specifically, it did not include a test plan. AECL responded:
no single test plan and report exists for the software since both hardware and software were tested and exercised separately together for many years.
The FDA was not pleased with that, and after more back and forth, replied:
We also expressed our concern that you did not intend to perform the [test] protocol to future modifications to the software. We believe that rigorous testing must be performed each time a modification is made to ensure the modification does not adversely affect the safety of the system.
While AECL struggled to include complex tasks like testing in their CAP, they had released instructions that allowed for a temporary fix to prevent future incidents. Unfortunately, in January, 1987, there was another incident, caused by a different software bug.
In this bug, there was a variable shared by multiple processes, meant as a flag to decide whether or not the beam collimator in the turntable needs to be checked to ensure everything is in the correct position. If the value is non-zero, the check needs to be performed. If the value is zero, it does not. Unfortunately, the software would increment the field, and the field was only one byte wide. This meant every 256th increment, the variable would be zero when it should have been non-zero. If that incorrect zero lined up with an operator action, the beam would fire at full energy without the turntable in the right position.
AECL had a fix for that (stop incrementing and just set the value), and amended their CAP to include that fix. The FDA recognized that was probably going to fix the problem, but still had concerns. In an internal memo:
We are in the position of saying that the proposed CAP can reasonably be expected to correct the deficiencies for which they were developed (Tyler). We cannot say that we are [reasonably] confident about the safety of the entire system…
This back-and-forth continued through a number of CAP revisions. At each step in the process, the FDA found issues with testing. AECL's test process up to this point was simply to run the machine and note if anything went wrong. Since the software had been in use, in some version, for over a decade, they did not see any reason to test the software, and thus had no capacity or plan for actually testing the software when the FDA required it.
The FDA, reviewing some test results, noted:
Amazingly, the test data presented to show that the software changes to handle the edit problems in the Therac-25 are appropriate prove the opposite result. … I can only assume the fix is not right, or the data were entered incorrectly.
Eventually, the software was fixed. Legislative and regulatory changes were made to ensure incidents like this couldn't happen in the future, at least not the same way.
It's worth noting that there was one developer who wrote all of this code. They left AECL in 1986, and thankfully for them, no one has ever revealed their identity. And while it may be tempting to lay the blame at their feet—they made every technical choice, they coded every bug—it would be wildly unfair to do that.
With AECL's continued failure to explain how to test their device, it should be clear that the problem was a systemic one. It doesn't matter how good your software developer is; software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.
While the incidents at the ETCC finally drove changes, they weren't the first incidents. Hospital physicists had already reported problems to AECL. At least one patient had already initiated a lawsuit. But that information didn't propagate through the organization; no one put those pieces together to recognize that the device was faulty.
On this site, we joke a lot at the expense of the Paula Beans and Roys of this world. But no matter how incompetent, no matter how reckless, no matter how ignorant the antagonist of a TDWTF article may be, they're part of a system, and that system put them in that position.
Failures in IT are rarely individual failures. They are process failures. They are systemic failures. They are organizational failures. The story of AECL and the Therac-25 illustrates how badly organizational failures can end up.
AECL did not have a software process. They didn't view software as anything more than a component of a larger whole. In that kind of environment, working on safety critical systems, no developer could have been entirely successful. Given that this was a situation where lives were literally on the line, building a system that produced safe, quality software seems like it should have been a priority. It wasn't.
While the Therac-25 incident is ancient history, software has become even more important. While we would hope safety-critical software has rigorous processes, we know that isn't always true. The 737MAX is an infamous, recent example. But with the importance of software in the modern world, even more trivial software problems can get multiplied at scale. Whether it's machine learning reinforcing racism, social networks turning into cesspools of disinformation or poorly secured IoT devices turning into botnets, our software exists and interacts with the world, and has real world consequences.
If nothing else, I hope this article makes you think about the process you use to create software. Is the process built to produce quality? What obstacles to quality are there? Is quality a priority, and if not, why not? Does your process consider quality at scale? You may know your software's failure modes, but do you understand your organization's failure modes? Its blind spots? The assumptions it makes which may not be valid in all cases?
Let's return for a moment to the race condition that caused the ETCC incidents. This was caused by users hitting the up arrow too quickly, preventing the system from properly registering their edits. While the FDA CAP process was grinding along, AECL wanted to ensure that people could still use the Therac-25 safely, and that meant publishing quick fixes that users could apply to their devices.
This is the letter AECL sent out to address that bug:
SUBJECT: CHANGE IN OPERATING PROCEDURES FOR THE THERAC-25 LINEAR ACCELERATOR
Effective immediately, and until further notice, the key used for moving the cursor back through the prescription sequence (i.e., cursor "UP" inscribed with an upward pointing arrow) must not be used for editing or any other purpose.
To avoid accidental use of this key, the key cap must be removed and the switch contacts fixed in the open position with electrical tape or other insulating material.
For assistance with the latter you should contact your local AECL service representative.
Disabling this key means that if any prescription data entered is incorrect, than "R" reset command must be used and the whole prescription reentered.
For those users of the Multiport option, it also means that editing of dose rate, dose, and time will not be possible between ports.
On one hand, this is a simple instruction that would effectively prevent the ETCC incidents from reoccurring. On the other, it's terrifying to imagine a patient's life hanging on a ripped up keycap and electrical tape.
This article is intended as a brief summary of the incident. Most of the technical details in this article come from this detailed account of the Therac-25 incident. That is the definitive source on the subject, and I recommend reading the whole thing. It contains much more detail, including deeper dives into the software choices and organizational failures.