Marcus A. worked for a man who believed that regular expressions were the be-all end-all and could be used to trivially solve every data problem that could possibly arise. Their code base was riddled with regular expression transformations that would reduce most developers to tears. This manager also believed that anything that could be explained could be implemented more cheaply offshore.

Their main application contained a raw text field that held comments by customers. Someone got the idea that these comments could be mined and used for business purposes. However, to do this in a free format text field, the non-standardized words and phrases used by humans would need to be cajoled into something that was more easily processed. To this end, Marcus was tasked with managing an outsourced effort to standardize this data. The input would be a Customer Data table with a comments column. The output would be the same column, but with abbreviations, acronyms, etc. converted to standardized text.

To prepare for outsourcing, an aide manually searched the myriad rows in the database for a list of such acronyms and abbreviations, and transcribed it into an Excel spreadsheet. There were more than 1,000 items, all specified in all the usual ways folks might abbreviate them:

        Old Content      New Content
	---------------- -----------------
	"IRS"            Internal Revenue Service
	" IRS "          Internal Revenue Service
	"I.R.S."         Internal Revenue Service
	"irs"            Internal Revenue Service
	"i.r.s."         Internal Revenue Service
	"I R S "         Internal Revenue Service
	"FBI"	         Federal Bureau of Investigation
	" FBI "          Federal Bureau of Investigation
	...
	"DR"	         Doctor
	"dr"	         Doctor
	"dr."	         Doctor
	" dr "           Doctor
	...
	"off"            Office
	...
	"fnf"            FILE_NOT_FOUND

A work scope document was prepared, contracts were bid, and the work was shipped offshore to a coder named Rahul.

Now Rahul could have realized that this could be generalized, and implemented a relatively short solution using some lists of strings and a couple of loops. Or he could blindly implement exactly what was requested. The resulting code that was turned in consisted of hundreds of brute-forced procedures that all looked like this:

  String sourceRegularExpression = "IRS";  // OP: hard coded in every function for each of the values
  String targetRegularExpression = "Internal Revenue Service";
  String query = "select id,comments from CustomerData where comments like '%" + sourceRegularExpression + "%'";
  Connection con = ...
  PreparedStatement ps = con.prepareStatement(query);
  ResultSet rs = ps.executeQuery();
  while (rs.next()) {
    Long id = rs.getLong(1);
    String comments = rs.getString(2);
    comments = comments.replaceAll(sourceRegularExpression, targetRegularExpression);
    PreparedStatement ps2= con.prepareStatement("update CustomerData set comments = ? where id = ?");
    ps2.setString(1,comments);
    ps2.setLong(2,id);
    ps2.executeUpdate();
    ps2.close();
  }
  rs.close();
  ps.close();
  con.close();

Of course, all of the functions were called in a single large wrapper function. Thus the database was queried once for every string change. Every row that was retrieved was updated on the way back in. If the same row had multiple hits, it was returned and updated multiple times. Not to mention that the database would be grinding away as the number of customer comments grows over time.

Most importantly, Marcus can only hope that nobody ever commented about their first drinking and driving offense.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!