| « TAG++ | Gary Strikes Again » |
It’s been a rough couple weeks. Not only did I have all sorts of catching-up to do after Code PaLOUsa, but it also happened to be release week. And oh, do I hate release week.
Don’t get me wrong, I’m just as excited as anyone else when a new release of BuildMaster comes out (it was release 2.3, in case you were wondering), but new releases mean testing. And fixes. And more testing. And still more testing. And oh, do I hate testing.
As I worked my way through the drudgery that was release week, I spent a lot of time thinking about testing. How did I get stuck on the test team? Why didn’t I call in sick today? Can’t we get someone else to do this? There were even a few times I asked the fundamental and basic question, what is the whole point of testing software in the first place? While I never found answers to most questions, the latter question has a very simple answer.
Testing is performed to reduce the risk of introducing defects in production.
And that’s it. Of course, you can do other things while testing (such as Test-Driven Development or Training through Testing), but then again you can overload just about any activity. But when the primary purpose of an overloaded activity is no longer necessary (commuting sixty miles, for example), the secondary purposes are often achieved in a more efficient manner: you don’t need to drive a car in order to listen to the radio.
In my career, I’ve heard of dozens of different types of tests and testing techniques, but when you look at things as a whole, it’s relatively simple. There are exactly five categories of tests that can be performed on software, and they’re generally performed in a sequential order.
Every type of test fits into one or more of these categories. Automated unit tests, for example, are generally considered to be a type of Integration Testing, as they test inter- and intra-component integration. Guerilla testing – i.e. clicking on a bunch of things in no particular order hoping to find something that’s broken – is generally a form of acceptance testing, but it’s chaotic enough that you could count it as integration or quality testing, too. But regardless of the category, testing as a whole is performed to give a “good enough” answer to the following questions:
I say “good enough” because no matter how hard you try, a definitive answer is impossible. At best (i.e., with unlimited resources), you can be 99.999…% confident that there will be no defects in production.
The reason that no amount of testing can provide 100% accuracy goes back to a fundamental problem posed by Plato: quis custodiet ipsos custodes?, or who will guard the guardians? The tests themselves can be flawed and allow otherwise detectable defects to go to production. While one could certainly test their tests, those test tests would face a similar problem. As would the test test tests. And the test test test tests, ad infinitum.
Like many things that converge on perfection, there are significantly increasing costs as you approach 100%. A five-minute smoke test may only provide 40% certainly, but it may cost five hours of testing to achieve 60%, and fifty hours to achieve 80%.
Because no amount of testing can prevent all defects, there is always a risk to making changes. You might not think that simply changing the label next to a text field could cause anything to go wrong, but it has happened before (I’ve seen it first hand), and it will happen again. It doesn’t matter what the source of the defect is (code, deployment, configuration, etc.), the fact is that the defect was introduced as an end result of a change.
The only way to completely avoid the inherent risk of change is to avoid change altogether, but that’s as feasible of an option as never leaving the house to avoid getting hit by a bus. As important as it is to reduce the risk of defects through testing, it’s equally important to consider the remaining, “untestable” risk.
The risk of change, therefore, is the function of three factors:
{Change Impact} x {Severity of Defect} / {Thoroughness of Testing}
We’re generally pretty good at balancing these three factors, at least when it comes to computer-related changes. While network Operations will generally just implement a DNS change without reproducing an entire network infrastructure just to make sure that the change won’t cause any problem, I doubt you would bat an eye if the Mars Rover team tested commands before sending them using a replica Mars Rover sitting on a pile of replica Mars rocks.
But oftentimes, our risk management of software-related changes is a little out of balance.
Every now and then, I’ll talk to a developer that will proudly proclaim, “we’ve finally achieved 100% code coverage!”
For those unaware, that metric refers to the fact that every single line of code in a codebase will be executed by an automated unit test. It’s the Diebold XL2400 Bank Vault Door of unit testing, complete with 16” thick stainless steel cladding and a time-sensitive lock. And like any impenetrable entryway, it’s only as secure as its weakest link. Installing one next to a paned window would render it entirely useless.
The same rule applies to those iron-clad code coverage metrics. Who cares if the there’s 100% code coverage when a unit test has a defect in it? Or if the requirements were misunderstood by the developer? Or if the requirements were wrong? Or if it’s not PCI compliant? Or if it breaks when it gets deployed to production?
It doesn’t matter how comprehensive your unit tests are if your functional, acceptance, quality, and staging tests are inadequate. Defects will simply slip through the most un-tested part.
When I explain all of this to that enthusiastic developer, the response is sometimes along the lines of, “but that’s not my job, so who cares?”
That’s an unfortunate attitude to have. While it’s true that we, as programmers, are paid primarily to write code that conforms to requirements, the reason that we’re being paid is so that the organization can have adequate software. Not caring about the end result reminds me of that old contractor joke.
The foundation guy notices a problem with the plans, but says that the framer will fix it. The framer says that the drywaller will fix it, the drywaller says the finish carpenter will fix it, the finish carpenter says the painter will fix it, and the painter says “I sure hope the homeowner is blind and doesn’t see it.”
A true craftsman is not only passionate about the quality of his work, but of the quality of the entire project.
You may have noticed that I’ve used terms like “good enough” and “adequate” to describe the quality that we should strive for instead of words like “high” and “utmost.” The key difference is that “adequate” is a variable quality level that can be anything from “below average” to “above average”, whereas “high” generally refers to well above average.
I understand that striving for “adequacy” may seem hypocritical for someone who has so frequently lambasted low quality software, but allow me to explain. Actually, allow my refrigerator to explain.
A little more than a year ago, I was in the market for kitchen appliances and had a pretty good idea of what I could get with my budget. It wasn’t a whole lot, but then again, neither was my budget. And then I stumbled across this LG Side-by-Side. It was a 26.5 cubic foot fridge with contoured doors, hidden hinges, an in-door icemaker, and several other features that I couldn’t afford, but it had a deeply discounted price tag that brought it in my budget. The sides of the unit told why: it was as if Wolverine himself had unloaded it off the truck.
The unsightly gashes along the sides of the fridge were clearly a defect introduced during shipping, but it wasn’t a problem for me. In fact, it was a welcome defect, and I wish that Wolverine was assigned to unload my microwave, stove, and dishwasher. I would have been able to get more features at the cost of defects that were not problems to me.
I realize that there are many differences between software and refrigerators, but the variable nature of quality is similar. In many cases, introducing defects through change just isn’t that big of a problem.
Sometimes, it just makes sense to pay for quantity (more features) instead of quality (more testing). Practically speaking, that means spending your time writing new features instead of building unit tests, or vice versa. Either way, it’s not really our decision to make, since we’re not the ones paying our salaries.
Ultimately, the decision of quality over quantity should rest with the individual or organization that is paying for the software. Obviously, it’s our obligation as professionals to not only educate these decision makers about the risks of defects, but to also provide recommendations to help facilitate their decision.
This can sometimes be difficult, especially since many of us would love nothing more than to build Xanadu. But just as it would be negligent to not recommend a comprehensive test plan for the software that powers an MRI machine, it would be equally negligent to recommend that same level of testing for your church’s congregation database.
It all goes back to assessing the “Severity of Defects”. In this example, it’s not really a problem if Father Cronin needs to type in “<br /> ” for a line break (it probably isn’t even worth your bill rate to fix that defect), but it certainly is a problem if an integer overflow might cause the pressure monitoring system to crash, which in turn might cause an MRI machine explosion.
No amount of testing can completely eliminate the risk of introducing defects. The harder you try, the more costly it becomes, and there comes a point where the cost of insuring against a risk is no longer worth the premium. Therefore, Testing Done Right is an exercise in reducing the risk of change to an acceptable level.
There are no hard and fast rules for determining what the “acceptable” level of risk is, but the factors to consider are the frequency of changes, the impact of those changes, and the severity of defects. But keep in mind how they relate to each other. For example, if an application will likely only change every year or so, investing upfront in an automated testing may be:
Equally important in assessing the risk is mitigating the risk. Remember, defects will simply slip through the most un-tested part, so a balanced testing plan is critical.
It’s a complete waste of resources to develop an application with 100% unit test coverage but limited functional, acceptance, and staging testing. Of course, the absolute best way to reducing the risk of defects in a system is to minimize the codebase and to keep things as simple as possible, thereby reducing the number of components and the overall complexity. But that’s a whole different soapbox.
|
I agree testing designed to reach some level of quality that will never be 100%.
Where I often see a lack in software is what I would call recoverability: The ability to correct a problem after it has occurred. I'm generally a cautious individual. Experience has demonstrated that I should always have a fallback plan, which is really a chain of defenses thing. If the application has no recoverability, then there is no fallback: The only thing between "everything is good" and "absolute disaster" is a fence called "everything works perfectly". When designing applications, one of the things that should always be done is to consider, "What if this doesn't work? What would be our fallback plan?" Because, if you don't think about that and plan for that, then one day something doesn't work perfectly and you find yourself in absolute disaster land because you have no other line of defense. That is actually the source of some really good stories (in here as well as in other places). I'll relate one:
Okay, so now let's create a fallback. That's hard, right? No, in this case actually it isn't: The solution is to back up the entire database before running the apply process. Every single time a batch is to be applied! That way, if something goes wrong, you fix the problem, restore, rerun and everything is cool. ...and usually, fallback is just like that. It mostly consists of one single element that I often see omitted: Keep the input state so that rerun is possible. There are "bazillions" of ways to do that; take your own pick. But some people like to live on the edge and depend on the application doing everything right, and when it doesn't, well, glad I'm not them. |
They might have done both. "Gahd dammit, Wolverine, that's your last screw-up! Go and get a job as a barber or something! Hmm ... reckon someone would buy this thing cheap? It's only scratched ..." |
|
Interesting read... You should really check ISTQB or ASTQB
http://www.istqb.org (or) http://www.astqb.org/ Particulary the foundation glossary. It has some of the concepts you mention here completely chewed out for you + tons more. (Especially the test levels and the risk calculation). Regards, Niels (ISTQB CTAL TM) |
|
I'd just like to point out: it is still theoretically possible to get hit by a bus even if you never leave your house.
99.999%... |
|
Or, rather recently, from around where I live:
BTW: This is not funny. Two people died in this accident. |
| « TAG++ | Gary Strikes Again » |