The Daily WTF: Curious Perversions in Information Technology

JBert · 2015-02-02 Reply Admin

Hmmm, the article seems to have an incorrect assertion of one of the test cases.

www.google.com actually matches, thanks to the ? after the http branch.

Another WTF with this is expression is the inconsistent escaping of dots, e.g. the regex matches wwwagoogle.com.

Anonymous · 2015-02-02 Reply Admin

JBert:
Another WTF with this is expression is the inconsistent escaping of dots, e.g. the regex matches wwwagoogle.com. (Hurray for white-box testing)

Even google.com matches, I think. This article is TRWTF.

aliceif · 2015-02-02 Reply Admin

Not to forget, http://sites.google.com works, too, because the www. is OR-ed with alphanum.

PJH · 2015-02-02 Reply Admin

https://www.regex101.com/ reports 3 unmatched / delimiters as well.

another_sam · 2015-02-05 Reply Admin

JBert:
www.google.com actually matches

So the bug is the opposite of that asserted in the article? That's not a URL and shouldn't validate as one.

JBert · 2015-02-05 Reply Admin

Depending on the context, it could be a valid URL (you could buttume it's http like most browsers do nowadays).

My main gripe is that the article starts off with a nice list of things to be validated and then somehow gets those test results wrong.

Sizik · 2015-02-05 Reply Admin

Wait, did you actually type "buttume", or is it just a wordfilter of assume?

Robert_Morson · 2015-02-05 Reply Admin

Your conclusion is TRWTF. I'm hearing, "This is an incorrect regex. Therefore, regex should not be used for this purpose."

If that line of logic made sense, I could find examples that proved that every programming language in the world was useless. I suppose it would make more sense to validate the URL using 25 MB of nested if-else statements?

TwelveBaud · 2015-02-05 Reply Admin

Sizik:
Wait, did you actually type "buttume", or is it just a wordfilter of assume?

He did type "buttume". [It's a site in-joke](http://forums.thedailywtf.com/forums/t/5552.aspx).

Yamikuronue:
• http://www.test.com?pageid=123&testid=1524 (no url parameters)

This one doesn't validate because of the lack of trailing slash after the domain name. If you add it back in, the query string is considered A-OK.

Also not validating:

http://www.glico.xn--zckzah (no IDN TLDs, although IDN subdomains are OK.)

JBert · 2015-02-05 Reply Admin

Sizik:
Wait, did you actually type "buttume", or is it just a wordfilter of assume?

You must be new here.

It's a clbuttic in-joke, look it up in the brillant meme wiki.

Edit: Hanzo'd. (That's another meme I'm now too lazy to find the source of)

Jaloopa · 2015-02-05 Reply Admin

All the problems people point out are symptoms of the main Regex problem: they're all but impossible to read

accalia · 2015-02-05 Reply Admin

well that and they're only meant to parse regular language....... and pretty much nothing is a pure regular language...

Nprz · 2015-02-05 Reply Admin

The regex is buggy / has a lot of redundant info in it. I don't think a regex for a URL would be too difficult and I'd question the need for upper/lower case matching as I think the ignore case option could be used. If the author did not support the news protocol then not including it in the regex is not a bug. I wrote an http client. If someone passed in ftp:// I'd return an error (actually I'd just crash since I don't care :wink:).

Steve_The_Cynic · 2015-02-05 Reply Admin

Jaloopa:
All the problems people point out are symptoms of the main Regex problem: they're all but impossible to read

And, if this example is anything to go by, they aren't exactly easy to write(1), either.

(1) So that they work correctly, that is.

Other oddities:

it checks carefully for www followed by what it wants to be just a dot but is actually any character at the beginning, but allows as an alternative a sequence of alphanumerics followed by the very same dot that isn't just a dot. And of course, it overlooks the fact that www is a sequence of alphanumerics...
subsequent "words" ("domain-name" parts) are allowed to have - in them, but the first is not.
the TLD part is limited to six characters. a look at http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains suggest that while this copes reasonably well with the majority of domains, it isn't at all adequate. Things like .cancerresearch strike me as being (a) stupid and (b) rather longer than six characters.
IPv4 address URLs are, indeed, not matched, but because the "TLD" isn't pure alphabetic.
IPv6 address URLs are rejected because the "domain name" starts with a square bracket, which isn't allowed anywhere in a URL by this regex.
http://a..b...................com/ is accepted, but can't be resolved because DNS doesn't allow empty "labels" (parts between dots).
port numbers are accepted, from 0 to 99999. Most OSes don't allow the use of port zero as the destination of a connect() call, and port numbers are 16-bit unsigned ints, so the last valid one is 65535.
multiple port numbers are accepted (http://www.example.com:42:80:999/)
percent-sign URL encoding is mis-handled - http://www.example.com/%g is treated as valid, and the last time I looked, g isn't a valid hex digit, nor is it two of any kind of digit. (I'm willing to be corrected on this point - I don't have the time to hunt down whether %g is really valid in a URL, but it doesn't feel right.)

Overall, I'd give it about half a point out of 20, because the original author of this monstrosity at least pretended to try.

BobbyTables · 2015-02-05 Reply Admin

It's because the article uses a Java regex, not the format regex101 expects. (it expects it in the form of /<regex>/<flags> so, obviously, the forward slash would be premature end of the regex)

https://regex101.com/r/sW4zZ6/2

The following are matched, where the article claimed they aren't:

http://sites.google.com www.google.com http://google.com

That doesn't mean the regex is good, though, here are some more things that are valid http://www,google.com - with a coma, because the regex uses a dot (match anything) rather than an escaped dot (which is a dot) http://w,google.com - because it can also start with any alphanumeric and then it has dot, rather than escaped dot.

news://www.example.com (only http, https, ftp, and ftps allowed)

Yes, it does not need to match, depending on what URLs are supposed to be valid. Pointing it out as a mistake is like claiming file://C:\My Documents\MyPicture.jpg is supposed to be a valid image to put in a message. In fact, on Firefox that doesn't want to be resolved.

http://www.test.com?pageid=123&testid=1524 (no url parameters)

Yes URL parameters. The regex does require the trailing slash, so http://www.test.com/?pageid=123&testid=1524 matches, but that doesn't mean URL parameters are not supported.

ftp://user:[email protected] (no basic auth credentials allowed)

They don't need to be allowed. It would be up to the requirements whether they could be there or not.

So, out of the 9 problems stated, 3 are incorrectly labeled as problems (calid URLs that the regex recognises as valid), one is completely made up (news:// doesn't need to be valid), one is a problem but the explanation why is completely off (URL parameters not being allowed vs an end slash being mandatory), and one is entirely dependant on assumed requirements that are not specified (basic credentials for FTP links). Which means there are only 3 entirely valid problems found. There were at least a couple of very glaring, very obvious issues not found, either (www,).

pedantic_git · 2015-02-05 Reply Admin

Thanks to all the ? and * after most of the groups, this regex matches all kinds of things that aren't URLs. For starters, the string "aa.a.aa" matches this regexp and I'm pretty sure that's not a valid URL.

Hanzo · 2015-02-05 Reply Admin

I'd like to see the proof that a URL is not a regular language. It's simple: use the pumping lemma. You've got one hour.

aliceif · 2015-02-05 Reply Admin

.aa doesn't exist, yes. However, xy.i.de would be valid.

tarunik · 2015-02-05 Reply Admin

Robert_Morson:
I suppose it would make more sense to validate the URL using 25 MB of nested if-else statements?

:headdesk:

No, you use a parser for a LL or LR grammar...really folks, parser combinators have been a thing for how long now?

VinDuv · 2015-02-05 Reply Admin

Steve_The_Cynic:
http://a..b...................com/ is accepted, but can't be resolved because DNS doesn't allow empty "labels" (parts between dots).

In fairness, Discourse thinks it’s a valid URL, too.

Steve_The_Cynic:
port numbers are accepted, from 0 to 99999. Most OSes don't allow the use of port zero as the destination of a connect() call, and port numbers are 16-bit unsigned ints, so the last valid one is 65535

You probably don’t want to check that with a regex, tough...

Steve_The_Cynic · 2015-02-05 Reply Admin

VinDuv:
In fairness, Discourse thinks it’s a valid URL, too.

Wait, what? Are you being fair toward Discourse? Or do you mean that there are other things that are just as crap as Discourse?

VinDuv:
You probably don’t want to check that with a regex, tough...

Well, no, I wouldn't want to do that.

anonymous234 · 2015-02-05 Reply Admin

news://www.example.com (only http, https, ftp, and ftps allowed)

Isn't that what you usually want though?

There's really no way to say what kind of validation is best if we don't know what the URLs are used for. If you need to retrieve something from it, it's usually best to just pass them to the library that does that and let it handle it.

Keith · 2015-02-05 Reply Admin

VinDuv:
You probably don’t want to check that with a regex, tough...

I've had a go.

/^([0-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5]|\d{0-4})$/

anonymous234 · 2015-02-05 Reply Admin

I tried to access http://www.⌘.ws/

[image]

Hanzo · 2015-02-05 Reply Admin

Keith:
/^([0-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5]|\d{0-4})$/

Ugly, but it works. And then you do something like

var validPortRegex = "whateveryoutypedthere";
var validPortRegexOption = "(:" + validPortRegex + ")?";
...
var urlRegex = protocol + "://" + domain + validPortRegexOption + ...;

That's reasonably readable (apart from the escaping you need). I'm still curious if anyone can prove URLs are not a regular language. I've never seen a construction like a^n b^n or x w x.

Jarry · 2015-02-05 Reply Admin

AFAIK URI's are a regular language. this guy says so: http://cmsmcq.com/mib/?p=306 BNF

PJH · 2015-02-05 Reply Admin

http://regexlib.com/REDetails.aspx?regexp_id=2814

:(6553[0-5]|655[0-2][0-9]\d|65[0-4](\d){2}|6[0-4](\d){3}|[1-5](\d){4}|[1-9](\d){0,3})

T'interweb™. Never Short Of Idiots™.

Dragnslcr · 2015-02-05 Reply Admin

pedantic_git:
For starters, the string "aa.a.aa" matches this regexp and I'm pretty sure that's not a valid URL.

Why isn't it a valid URL?

Keith · 2015-02-05 Reply Admin

PJH:
T'interweb™. Never Short Of Idiots™.

Quite similar to mine except this bit is wrong:

655[0-2][0-9]\d

and I'm not sure why you'd type this:

[1-9](\d){0,3}

instead of this:

\d{0-4}

Did we decide that 0 is invalid?

PJH · 2015-02-05 Reply Admin

Dragnslcr:
Why isn't it a valid URL?

NXDOMAIN? :laughing:

[pjh@sofa ~]$ whois a.aa
No whois server is known for this kind of object.
[pjh@sofa ~]$ nslookup a.aa
Server:         8.8.8.8
Address:        8.8.8.8#53

** server can't find a.aa: NXDOMAIN

PJH · 2015-02-05 Reply Admin

Keith:
and I'm not sure why you'd type this:
[1-9](\d){0,3}
instead of this:
\d{0-4}

One matches 0999, the other doesn't.

@discoursebot - formatting in the quote!

Keith · 2015-02-05 Reply Admin

Am I not allowed to express my port as 00080?

Keith · 2015-02-05 Reply Admin

Why are we still discussing this? Stop discussing this @PJH.

Jaloopa · 2015-02-05 Reply Admin

Validating a URL should use the same philosophy as validating an email: check if it exists.

Fires off a request, if it's not 404^* you're good to go

^*probably other codes, too. I dunno, do I look like a web guy?

PJH · 2015-02-05 Reply Admin

Keith:
Am I not allowed to express my port as 00080?

Not if what subsequently parses it presumes the leading 0 means something, and decides that that particular thing isn't actually a number it recognises.

CoyneTheDup · 2015-02-05 Reply Admin

I'm not sure, it's kind of a tossup, but maybe that code is more reliable than this I once saw somewhere (redone from memory):

boolean validURL(String url)
{
    return url.length > 7 && url.substring(0,7).equals("http://");
}

CarrieVS · 2015-02-05 Reply Admin

Steve_The_Cynic:
it checks carefully for www followed by what it wants to be just a dot but is actually any character at the beginning, but allows as an alternative a sequence of alphanumerics followed by the very same dot that isn't just a dot. And of course, it overlooks the fact that www is a sequence of alphanumerics...

There's no + after the alphanumeric character class. I'm not the most fluent regex-mangler in the world and I think this is Java which I've not been near for a while, so correct me if I'm wrong but I believe it's precisely one alphanumeric character followed by (any character whatsoever, but they probably meant a dot).

Of course since the . wasn't escaped ww would match, and the remaining w. would match the next part, so the special case is still redundant.

boomzilla · 2015-02-05 Reply Admin

Keith:
Am I not allowed to express my port as 00080?

Buzz

Keith:
Stop discussing this

I think you're right.

CoyneTheDup · 2015-02-05 Reply Admin

This regex still allows port 0.

DCRoss · 2015-02-05 Reply Admin

Well, at least nobody tried to parse HTML with a regular expression.

uiron · 2015-02-05 Reply Admin

It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site. Yeah, all of us have probably seen interesting URL and email validation, maybe even wrote one or two in their lives, it's just poorly written code, and as a joke, pretty much anticlimactic. Regexes are irreplaceable in some cases, and far from requiring a cryptic spaghetti to solve the problem.

FroshKiller · 2015-02-05 Reply Admin

That's silly. Maybe I'm using a URL validator to make sure that the new URL someone entered for a new deployment is a valid URL. There is a difference between an invalid URL and a valid URL that just doesn't currently resolve to anything.

boomzilla · 2015-02-05 Reply Admin

uiron:
It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site.

:frystare:

@antiquarian, I think we have another one for you.

antiquarian · 2015-02-05 Reply Admin

I've seen better.

mott555 · 2015-02-05 Reply Admin

uiron:
It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site.

This feels like a challenge. What can I get past the editors and do to get all the front page commenters worked up? :laughing: :trollface:

HardwareGeek · 2015-02-05 Reply Admin

VinDuv:
In fairness, Discourse thinks it’s a valid URL, too.

Well, that's it then. It must be valid; we all know how good Discourse is at parsing and validating stuff.</sarcasm>

JBert · 2015-02-05 Reply Admin

Redefining front-page trolling?

uiron · 2015-02-05 Reply Admin

hah was not expecting such backlash :) was it this rude? I genuinely apologize for the comment then, no excuses.

Yamikuronue · 2015-02-05 Reply Admin

uiron:
it's just poorly written code, and as a joke, pretty much anticlimactic

They can't all be winners.

Sincerely, Jane Bailey

mott555 · 2015-02-05 Reply Admin

I wasn't the author of this piece so I'm not offended, but I do like to troll. For the record, I enjoyed the backlash against the Hanzo articles.

How to Validate a URL

Leave a comment on “How to Validate a URL”