Searching for a Perl

When I was young and dumb, my freshman year college CS program taught us Scheme. Now, thinking myself a rather accomplished C++ programmer by that point (I was not), I thought this was a bit of an insult. But I was still interested in learning new languages, so I chose to dabble in Perl.

And I remember having the audacity to suggest to my professors that Scheme was a terrible introduction to programming, and instead we should start the students with an easy and accessible language, like Perl.

As I said, I was young and dumb. Today's anonymous submitter was handed some Perl code from their senior developer. Let's take a look at what a real Perl master can do:

        $pageA = httpRequest("$adress ");
        $pageA =~ s/<.+?>|&.+?;//g;
        @tempA = split(/\cj/,$pageA);

        for($i=0; $i<@tempA; $i++) {
           $tempA[$i] =~ m/$pattern/g; 
           if ($tempA[$i] =~ /$pattern/)
           {           
                $found = 1;
           }
        }      
        return $found;

We start by sending an HTTP request to fetch a page from $adress[sic]. We then… wait, those angle brackets in a regex. Oh no, are we parsing HTML via regular expressions. Well, no, not really. There won't be any Zalgo here.

The regex matches an angle bracket, followed by one or more other characters, non-greedily, then another angle bracket. Or anything that has an ampersand followed by one or more characters, again non-greedily, then a ;. That is to say- it's attempting to strip HTML tags and HTML entities out of the text. Of course, if you use an < for anything other than an HTML tag, or an & anywhere in the text, this will definitely break in interesting ways.

Then we split on \cj. The \c represents the start of a control-key escape sequence, so this is really ^j, which is apparently a method of representing line feeds. So it's breaking on newlines, in just the most "write only language" way possible.

With that done, we can now walk through the array of lines. We start by doing a match on the current line, searching for a pattern. Then we… do the match again, but this time with an if statement. If there is a match, we set $found = 1 and then keep searching.

What's "great" about this is that it only checks if a webpage contains a match and returns that. It doesn't return the match. It downloads a whole page, strips anything that vaguely might be an HTML tag or character entity, and then checks for a match on every line, only to return whether or not there was one.

But again, the important thing: at least they're not trying to parse it.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Featured Comments

Sauron (unregistered) 2023-06-22

That code is insane.

An arbitrary match somewhere in the middle HTML data doesn't mean a thing.

First, if there is no check that the HTML is well formed, then the result probably doesn't mean a thing (at best it is unreliable).

Also, HTML is a language to structure documents. An arbitrary match doesn't tell you whether you're matching some text content, some CSS, some JS, some SVG, or (God forbid!) the base64 code someone stupid put in the src tag of an <img>, some HTML comment, or whatever else that is technically valid (or not!) in today's sprawling web standards.