As most development managers know, the FBI's Virtual Case File (VCF) system has become the epitome of the software industry's most expensive failed project. Running taxpayers between $100 and $200 million dollars over four years, the VCF delivered little more than a mountain of useless documentation, nearly a million lines of code that will never run in production and a whole lot of costly lessons. Worse still, the lessons offered from this multi-million dollar failure could have just as easily been found in a $50 software engineering textbook. In fact, the major factors behind VCF's failure read much like such a book's table of contents:

  • Enterprise Architecture: VCF had none.
  • Management: Developers were both poorly managed and micromanaged.
  • Skilled Personnel: Managers and engineers with no formal training were placed in critical roles.
  • Requirements: They were constantly being changed.
  • Scope Creep: New features were added even after the project was behind schedule.
  • Steady Team: More people were constantly added to the team in an attempt to speed the project.

While these are all valuable lessons that every development manager should take to heart, one of the most important -- and certainly least discussed -- lessons stems from one of the rare correct decisions made on the project: the decision to cut bait and scrap the whole thing.

As painfully difficult as it must have been to halt the VCF, the alternative would have been far worse. If the VCF system had been put into production, the bureau's productivity would have ground to a halt as agents struggled to learn and use the bug-infested software. Programmers would have had to work day and night, hacking together patch after patch in a futile attempt to fix known defects, all while introducing countless new defects into the system. Untold millions would have been spent bringing the system to that dangerous state of "stable instability," where the costs and risks associated with fixing even the most trivial defect are far too great to justify. Eventually, and probably within a few short years, the system would have to be rebuilt from scratch.

Although only a handful of development managers will ever work on a project the size of VCF, most have experienced first-hand what the VCF would have become: a staggering behemoth application of rapidly decaying quality with a rotting, ever-growing codebase. This class of applications represents the worst type of software failure: the perceived success.

Worse than Failure

When an enterprise software application is deployed to production, it's almost universally considered to be a success. In most other industries, equating completion and success would be ludicrous. After all, few in the construction industry would deem a two-year-old house with a leaky roof and a shifting foundation anything but a disaster -- and an outright liability.

Yet, had the VCF system gone live, almost everyone involved -- from the business analysts to the programmers to the project managers -- would have called their work a success. Even after the code devolved into an unsustainable mess, many would still refuse to admit that they had failed. The result is a cycle of failure, as dev shops refuse to recognize and correct issues that plague their projects.

As editor of this site, I've been tracking software development meltdowns for years. Most of these stories share a lot of common elements: poor management, misguided process and loads of human error. In fact, the vast majority of the development failures I've documented have little, if anything, to do with tools and technologies, and everything to do with the people using them.

The only way to prevent this kind of failure is to be pragmatic about success. And realistically, enterprise software applications should last at least 15 years. Anything short of that and the software should be considered that short of being successful.

The primary reason that custom enterprise software is replaced is because it simply becomes too costly to maintain. This even holds true for old, highly successful software. Imagine the difficulty and cost of building (or even finding the programmers to build) a real-time inventory lookup module on a 1980's-era COBOL-based warehousing system. In situations like this, gutting the old system -- or at least replacing the bulk of it -- is usually the only sensible choice that allows for affordable maintenance.

A well-designed system built over the past eight to 10 years, however, shouldn't require a forklift upgrade.

Makings of a Mess

Maintainability is often at the heart of failed software projects, says (former) Microsoft MVP Phil Haack, a developer who frequently blogs about software maintainability. He learned the maintainability lesson early in his career, when his team was tasked with building a fairly simple marketing Web site that a client would use to receive feedback. The site included a form that visitors would fill out, with the input sent to the company as an e-mail.

The team decided to keep it simple: no database, no special configuration and no extensibility. And the site worked just fine. The project was delivered on time and on budget, and the client was satisfied. In fact, it worked so well that the client came back and asked for an additional form to be added to the software.

Though the software was never designed to handle multiple forms, a second form was not difficult to add. They were able to hack in the additional requirements and make it work exactly as the client requested. A month later, the client asked for a third form to be added. Haack's team hacked, compiled and delivered it. Then they asked for a fourth. Hack, compile, deliver. And a fifth. Hack, compile, deliver. With each change request, the once-simple application evolved into a tangled mess.

After a year's worth of changes, the Web application utilized a database, COM objects and all sorts of other technologies that were well beyond the scope of the original application. What's more, each change to the code became riskier and costlier as time progressed. The team was rapidly losing the ability to maintain the application.

"Each project took progressively longer, even though the requirements were as simple as all the previous ones," Haack says. "The application became so monolithic and so difficult to maintain."

Eventually, Haack's team had to break the news to their client: The application they had been building and extending for almost two years had become a complete mess. Worse still, it had grown so complicated that rewriting the app was no longer even feasible. They'd have to build a "legacy bridge" in a new application and then build a standard platform on top of that.

Admitting to Failure

It wasn't easy for Haack's team to admit to failure. Sure, they could have blamed their situation on the client's ever-increasing demands, but in the end it was their team that built the application and allowed it to degrade from a simple form to a maintenance nightmare. Their mistake wasn't in the original, simple design, but in not abandoning that design sooner.

In other words, the team's decision to build a simple application for a simple requirement was the right one. No one could have predicted that the software would have changed in the manner that it did. Had they built the initial version of the application to be more extensible, perhaps including a forms database or form groupings, the requirements could have easily shifted to make those features irrelevant or, worse, cumbersome.

"Avoid premature generalization," Haack advises. "Don't build the system to predict every change. Make it resilient to change."

As for knowing when to generalize, Haack lives by the rule of three: "The first time you notice something that might repeat, don't generalize it. The second time the situation occurs, develop in a similar fashion -- possibly even copy/paste -- but don't generalize yet. On the third time, look to generalize the approach."

In the case of Haack's Web application, this approach would have motivated the team to begin generalizing the working code with each new iteration. Over time, the changes might have prevented "ancient" parts of the code from becoming an unmanageable burden.

A Change-Friendly Environment

Both the VCF system and Haack's Web application had collapsed under their own weight, the former before it could even make it to production and the latter after a brief two-year life. They would have lasted indefinitely longer had they been maintainable.

Creating software that's maintainable -- meaning resilient to change -- must start early on, and at a much higher level than the code or even the application's design. Maintainable software begins at the highest level -- the Enterprise Architecture -- and works its way down through each phase of the software development lifecycle.

Business analysts and software developers must be trained to understand that the software they build can and will change in ways they'll never be able to predict. Therefore, tools such as refactoring and design patterns should be an integral part of the development process.

Testing and quality assurance also need to be resilient to change. When the software changes, so should the unit tests, smoke tests and regression tests. Without adequate testing early and throughout, defects from the constant changes will pile on and become even more difficult to resolve.

Ultimately, development managers must be able to recognize the point at which a software effort has failed and to be able to make the decision to move on. The sooner that can happen, the less damage the failed project will inflict. And, unlike the VCF, there might even be something left to salvage from the effort.

Development Disasters was originally published in the August 01, 2007 issue of Redmond Developer News. RDN is a free magazine for influential readers that provides insight into Microsoft's plans, and news on the latest happenings and products in the Windows development marketplace.