It’s been a rough couple weeks. Not only did I have all sorts of catching-up to do after Code PaLOUsa, but it also happened to be release week. And oh, do I hate release week.

Don’t get me wrong, I’m just as excited as anyone else when a new release of BuildMaster comes out (it was release 2.3, in case you were wondering), but new releases mean testing. And fixes. And more testing. And still more testing. And oh, do I hate testing.

As I worked my way through the drudgery that was release week, I spent a lot of time thinking about testing. How did I get stuck on the test team? Why didn’t I call in sick today? Can’t we get someone else to do this? There were even a few times I asked the fundamental and basic question, what is the whole point of testing software in the first place? While I never found answers to most questions, the latter question has a very simple answer.

Testing is performed to reduce the risk of introducing defects in production.

And that’s it. Of course, you can do other things while testing (such as Test-Driven Development or Training through Testing), but then again you can overload just about any activity. But when the primary purpose of an overloaded activity is no longer necessary (commuting sixty miles, for example), the secondary purposes are often achieved in a more efficient manner: you don’t need to drive a car in order to listen to the radio.

Types of Testing

In my career, I’ve heard of dozens of different types of tests and testing techniques, but when you look at things as a whole, it’s relatively simple. There are exactly five categories of tests that can be performed on software, and they’re generally performed in a sequential order.

  1. Integration Testing – formal or informal testing that is generally performed by the developer(s) responsible for the changes; this serves as a verification that the changes are integrated into the larger application and are ready for functional testing
  2. Functional Testing – formal test scripts (i.e. documents containing test cases, or step-by-step guides to verify functional requirements) are executed by testers
  3. Acceptance Testing – formal or informal testing to ensure that functional requirements as implemented are valid and meet the business need
  4. Quality Testing – formal or informal testing to ensure that non-functional requirements (regulatory, performance, etc) are met
  5. Staging Testing – verification that the software can be deployed to an environment that matches the production environment

Every type of test fits into one or more of these categories. Automated unit tests, for example, are generally considered to be a type of Integration Testing, as they test inter- and intra-component integration. Guerilla testing – i.e. clicking on a bunch of things in no particular order hoping to find something that’s broken – is generally a form of acceptance testing, but it’s chaotic enough that you could count it as integration or quality testing, too. But regardless of the category, testing as a whole is performed to give a “good enough” answer to the following questions:

  1. Does the program function?
  2. Does the program functionality meet the requirements?
  3. Do the requirements meet the need?
  4. Does the program meet quality standards?
  5. Can the program be deployed?

I say “good enough” because no matter how hard you try, a definitive answer is impossible. At best (i.e., with unlimited resources), you can be 99.999…% confident that there will be no defects in production.

The reason that no amount of testing can provide 100% accuracy goes back to a fundamental problem posed by Plato: quis custodiet ipsos custodes?, or who will guard the guardians? The tests themselves can be flawed and allow otherwise detectable defects to go to production. While one could certainly test their tests, those test tests would face a similar problem. As would the test test tests. And the test test test tests, ad infinitum.

Like many things that converge on perfection, there are significantly increasing costs as you approach 100%. A five-minute smoke test may only provide 40% certainly, but it may cost five hours of testing to achieve 60%, and fifty hours to achieve 80%.

An Inherent Risk

Because no amount of testing can prevent all defects, there is always a risk to making changes. You might not think that simply changing the label next to a text field could cause anything to go wrong, but it has happened before (I’ve seen it first hand), and it will happen again. It doesn’t matter what the source of the defect is (code, deployment, configuration, etc.), the fact is that the defect was introduced as an end result of a change.

The only way to completely avoid the inherent risk of change is to avoid change altogether, but that’s as feasible of an option as never leaving the house to avoid getting hit by a bus. As important as it is to reduce the risk of defects through testing, it’s equally important to consider the remaining, “untestable” risk.

  1. Change Impact – the estimated scope of a given change; this varies from change to change and, like testing, is always a “good enough” estimate, as even a seemingly simple change could bring down the entire system
  2. Severity of Defect – the impact of a defect on the overall system; this is a “constant” for a given system, as there is no way of knowing how severe a defect might be

The risk of change, therefore, is the function of three factors:

{Change Impact} x {Severity of Defect} / {Thoroughness of Testing}

We’re generally pretty good at balancing these three factors, at least when it comes to computer-related changes. While network Operations will generally just implement a DNS change without reproducing an entire network infrastructure just to make sure that the change won’t cause any problem, I doubt you would bat an eye if the Mars Rover team tested commands before sending them using a replica Mars Rover sitting on a pile of replica Mars rocks.

But oftentimes, our risk management of software-related changes is a little out of balance.

The Weakest Link

Every now and then, I’ll talk to a developer that will proudly proclaim, “we’ve finally achieved 100% code coverage!”

For those unaware, that metric refers to the fact that every single line of code in a codebase will be executed by an automated unit test. It’s the Diebold XL2400 Bank Vault Door of unit testing, complete with 16” thick stainless steel cladding and a time-sensitive lock. And like any impenetrable entryway, it’s only as secure as its weakest link. Installing one next to a paned window would render it entirely useless.

The same rule applies to those iron-clad code coverage metrics. Who cares if the there’s 100% code coverage when a unit test has a defect in it? Or if the requirements were misunderstood by the developer? Or if the requirements were wrong? Or if it’s not PCI compliant? Or if it breaks when it gets deployed to production?

It doesn’t matter how comprehensive your unit tests are if your functional, acceptance, quality, and staging tests are inadequate. Defects will simply slip through the most un-tested part.

When I explain all of this to that enthusiastic developer, the response is sometimes along the lines of, “but that’s not my job, so who cares?”

That’s an unfortunate attitude to have. While it’s true that we, as programmers, are paid primarily to write code that conforms to requirements, the reason that we’re being paid is so that the organization can have adequate software. Not caring about the end result reminds me of that old contractor joke.

The foundation guy notices a problem with the plans, but says that the framer will fix it. The framer says that the drywaller will fix it, the drywaller says the finish carpenter will fix it, the finish carpenter says the painter will fix it, and the painter says “I sure hope the homeowner is blind and doesn’t see it.”

A true craftsman is not only passionate about the quality of his work, but of the quality of the entire project.

Defects Are Not Necessarily Problems

You may have noticed that I’ve used terms like “good enough” and “adequate” to describe the quality that we should strive for instead of words like “high” and “utmost.” The key difference is that “adequate” is a variable quality level that can be anything from “below average” to “above average”, whereas “high” generally refers to well above average.

I understand that striving for “adequacy” may seem hypocritical for someone who has so frequently lambasted low quality software, but allow me to explain. Actually, allow my refrigerator to explain.

A little more than a year ago, I was in the market for kitchen appliances and had a pretty good idea of what I could get with my budget. It wasn’t a whole lot, but then again, neither was my budget. And then I stumbled across this LG Side-by-Side. It was a 26.5 cubic foot fridge with contoured doors, hidden hinges, an in-door icemaker, and several other features that I couldn’t afford, but it had a deeply discounted price tag that brought it in my budget. The sides of the unit told why: it was as if Wolverine himself had unloaded it off the truck.

The unsightly gashes along the sides of the fridge were clearly a defect introduced during shipping, but it wasn’t a problem for me. In fact, it was a welcome defect, and I wish that Wolverine was assigned to unload my microwave, stove, and dishwasher. I would have been able to get more features at the cost of defects that were not problems to me.

Quality or Quantity

I realize that there are many differences between software and refrigerators, but the variable nature of quality is similar. In many cases, introducing defects through change just isn’t that big of a problem.

Sometimes, it just makes sense to pay for quantity (more features) instead of quality (more testing). Practically speaking, that means spending your time writing new features instead of building unit tests, or vice versa. Either way, it’s not really our decision to make, since we’re not the ones paying our salaries.

Ultimately, the decision of quality over quantity should rest with the individual or organization that is paying for the software. Obviously, it’s our obligation as professionals to not only educate these decision makers about the risks of defects, but to also provide recommendations to help facilitate their decision.

This can sometimes be difficult, especially since many of us would love nothing more than to build Xanadu. But just as it would be negligent to not recommend a comprehensive test plan for the software that powers an MRI machine, it would be equally negligent to recommend that same level of testing for your church’s congregation database.

It all goes back to assessing the “Severity of Defects”. In this example, it’s not really a problem if Father Cronin needs to type in “<br /> ” for a line break (it probably isn’t even worth your bill rate to fix that defect), but it certainly is a problem if an integer overflow might cause the pressure monitoring system to crash, which in turn might cause an MRI machine explosion.

Testing Done Right

No amount of testing can completely eliminate the risk of introducing defects. The harder you try, the more costly it becomes, and there comes a point where the cost of insuring against a risk is no longer worth the premium. Therefore, Testing Done Right is an exercise in reducing the risk of change to an acceptable level.

There are no hard and fast rules for determining what the “acceptable” level of risk is, but the factors to consider are the frequency of changes, the impact of those changes, and the severity of defects. But keep in mind how they relate to each other. For example, if an application will likely only change every year or so, investing upfront in an automated testing may be:

  • Wasteful, assuming the application is relatively easy to maintain and a defect would cause, at worst, a few hours of inconvenience a month
  • Valuable, if the application is complex and a defect might cause a stoppage at a manufacturing facility

Equally important in assessing the risk is mitigating the risk. Remember, defects will simply slip through the most un-tested part, so a balanced testing plan is critical.

  1. Integration Testing – does the program function?
  2. Functional Testing – does the program functionality meet the requirements?
  3. Acceptance Testing – do the requirements meet the need?
  4. Quality Testing – does the program meet quality standards?
  5. Staging Testing – can the program be deployed?

It’s a complete waste of resources to develop an application with 100% unit test coverage but limited functional, acceptance, and staging testing. Of course, the absolute best way to reducing the risk of defects in a system is to minimize the codebase and to keep things as simple as possible, thereby reducing the number of components and the overall complexity. But that’s a whole different soapbox.