Today's submission comes from Florian, and it starts with "this app processes important business transaction by email", which is always a good way to start a WTF. I've seen a depressing number of applications in my life that use email as a means of data exchange. It's common enough that even the industry standard EDI protocol supports email as one of its transports, but hoo boy is that a terrible abuse of email systems.

Let's start in HtmlResponseParser.cs, which reads in HTML and "parses" it for certain structures so it can extract data.

   public static Response createFromAscii(string r)
        {
            string regEx = "^"
                + RegExpMgr.getRegularExpression("requestRefNum")
                + RegExpMgr.getRegularExpression("spaceTabSeparator")
                + RegExpMgr.getRegularExpression("bbTickerAndName")
              // snip: 10 more lines
                + ").*?";

            Match m = Regex.Match(r, regEx);

            if(m.Success)
            {
                // Let caller handle exceptions
                bool status = true;
                int id = parseID(m.Groups[1].Value);
                int type = parseType(m.Groups[1].Value);
                string ticker = m.Groups[2].Value;
                double quantity = parseQuantity(m.Groups[3].Value);
                DateTime date = parseDate
                    (
                    m.Groups[5].Value,
                    m.Groups[6].Value,
                    m.Groups[7].Value
                    );

Oh yeah, you know we're parsing HTML with regexes. What could go wrong with that? What's extra great about this version is that the regexes are clearly splitting on characters contained within some of the data- note the way they have to reconstruct the date by reading out of the groups. And also note how they don't take advantage of named groups at all- so we always have to access the group by the position, instead of some meaningful name.

But that's not the WTF. Why don't we take a look at the implementation of the RegExpMgr class.

   public class RegExpMgr
   {
      public const string RegExprDefsFile = "RegExpDefs.xml";

       public static string getRegularExpression(string name)
      {
         XmlDocument doc   = new XmlDocument();
         doc.Load(RegExprDefsFile);
         XmlNode root = doc.DocumentElement;   
         XmlNode node = root.SelectSingleNode("descendant::regularExpression[@name='" + name + "']");
         if (node != null)
            return node.InnerText.Trim();
         else
            return null;
      }
   }

The bad news is that they store all the regexes they want to use in an XML file. The good news is that they don't use regexes for parsing that XML file. The other bad news is that there's no caching of the contents of the file- every time you want a regex, we open the file, construct the DOM tree, and then use a query selector to find the one entry we want. Extra bonus points for returning null when a key can't be found, because why would we want to capture that as an exception?

And yet, that's still not the WTF, because Florian also gave us a big picture view of the business process.

First, a database stored procedure fetches a list of fields that will get exchanged with external parties and some text-based guidelines for how they need to validate. An application runs and reads that data, and uses it to construct an HTML table. It then connects to an instance of the Outlook application- yes really- to send that email out to the external parties. Once the email goes out, still through the Outlook application, the app monitors a specific folder in the inbox. When the third parties reply, an Outlook rule filters their email into that folder, which triggers the next step of the app. The app loads the email, parses the HTML table contents (which were edited by the external parties) and attempts to parse them (via regexes) back into data. This step fails 75% of the time, since it relies on the sender to get the table structured very specifically. It also can't really tell the difference between the header rows (because they're just regular table cells formatted in bold- there's no th tag here).

So it loads all of the data in the table, probably, and dumps it into a database table. The structure it dumps it in, however, is just as a set of space-separated substrings. So "this string of text" becomes the table (this,string,of,text). As you can imagine, this "temporary" table has a lot of columns. The app then reads this data back out of the database into an array of arrays of strings. Then for each string, it tries to determine if that string is a valid row, or not. And the way it does this is, well, it has a regex that tells it what a valid row should look like, as text. So what it does is it joins all of those cells back together and checks if it passes the regex.

Once that's done, it tries to save that data back in the original database table, thus allowing the third party the ability to "update" the data, all via email.

Florian closes:

That's it, I am offically going to the pub, it's been a long day

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!