Parsing HTML is no walk in the park. With the possibility of unclosed tags and mismatched quotation marks on any given page, it’s a veritable minefield of horrible hypertext. However, there are dozens of reliable libraries that a developer could use to do the heavy lifting.

But the heads of the project that Pedro worked on had chosen the worst library they could find.

The project was a census of web standards usage in the wild. Pedro was assigned to see why the HTML parsing library was choking on pages of 1MB or more, sometimes taking 24 hours to process a single page. The culprit, Pedro found, was the tag validation function:

public Results analyze(File page) throws Exception{
    Results output = new Results(page); 
    BufferedStringReader bw = new BufferedStringReader(new FileReader(page)); 
    String pageHTML = ""; 
        pageHTML += bw.readLine() + "\n"; 
    int index = 0; 
    for(index = pageHTML.indexOf("<"); index != -1; index = pageHTML.indexOf("<",index+1)){ 
        if(pageHTML.toLowerCase().regionMatches(index, "<a", 0, 2)){ //Do a test involving the <a> tag } 
        if(pageHTML.toLowerCase().regionMatches(index, "<br", 0, 3)){ //Do a test involving the <br> tag } 
        if(pageHTML.toLowerCase().regionMatches(index, "<img", 0, 4)){ //Do a test involving the <img> tag } 
        //snip about 70 other HTML tags, including a bunch where absolutely no tests were run. 
    return output; 

Not only did the validator test each potential tag against every known tag name, it also lowercased the entire document multiple times.

The library developer must have really wanted those tags to be lowercase.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!