• (disco)

    Hmmm, the article seems to have an incorrect assertion of one of the test cases.

    www.google.com actually matches, thanks to the ? after the http branch.

    Another WTF with this is expression is the inconsistent escaping of dots, e.g. the regex matches wwwagoogle.com.

  • (disco) in reply to JBert
    JBert:
    Another WTF with this is expression is the inconsistent escaping of dots, e.g. the regex matches wwwagoogle.com. (Hurray for white-box testing)

    Even google.com matches, I think. This article is TRWTF.

  • (disco) in reply to JBert

    Not to forget, http://sites.google.com works, too, because the www. is OR-ed with alphanum.

  • (disco) in reply to JBert

    https://www.regex101.com/ reports 3 unmatched / delimiters as well.

  • (disco) in reply to JBert
    JBert:
    www.google.com actually matches

    So the bug is the opposite of that asserted in the article? That's not a URL and shouldn't validate as one.

  • (disco) in reply to another_sam

    Depending on the context, it could be a valid URL (you could buttume it's http like most browsers do nowadays).

    My main gripe is that the article starts off with a nice list of things to be validated and then somehow gets those test results wrong.

  • (disco) in reply to JBert

    Wait, did you actually type "buttume", or is it just a wordfilter of assume?

  • (disco)

    Your conclusion is TRWTF. I'm hearing, "This is an incorrect regex. Therefore, regex should not be used for this purpose."

    If that line of logic made sense, I could find examples that proved that every programming language in the world was useless. I suppose it would make more sense to validate the URL using 25 MB of nested if-else statements?

  • (disco) in reply to Sizik
    Sizik:
    Wait, did you actually type "buttume", or is it just a wordfilter of assume?
    He did type "buttume". [It's a site in-joke](http://forums.thedailywtf.com/forums/t/5552.aspx).
    Yamikuronue:
    • http://www.test.com?pageid=123&testid=1524 (no url parameters)
    This one doesn't validate because of the lack of trailing slash after the domain name. If you add it back in, the query string is considered A-OK.

    Also not validating:

    • http://www.glico.xn--zckzah (no IDN TLDs, although IDN subdomains are OK.)
  • (disco) in reply to Sizik
    Sizik:
    Wait, did you actually type "buttume", or is it just a wordfilter of assume?
    You must be new here.

    It's a clbuttic in-joke, look it up in the brillant meme wiki.

    Edit: Hanzo'd. (That's another meme I'm now too lazy to find the source of)

  • (disco)

    All the problems people point out are symptoms of the main Regex problem: they're all but impossible to read

  • (disco) in reply to Jaloopa

    well that and they're only meant to parse regular language....... and pretty much nothing is a pure regular language...

  • (disco)

    The regex is buggy / has a lot of redundant info in it. I don't think a regex for a URL would be too difficult and I'd question the need for upper/lower case matching as I think the ignore case option could be used. If the author did not support the news protocol then not including it in the regex is not a bug. I wrote an http client. If someone passed in ftp:// I'd return an error (actually I'd just crash since I don't care :wink:).

  • (disco) in reply to Jaloopa
    Jaloopa:
    All the problems people point out are symptoms of the main Regex problem: they're all but impossible to read
    And, if this example is anything to go by, they aren't exactly easy to write(1), either.

    (1) So that they work correctly, that is.

    Other oddities:

    • it checks carefully for www followed by what it wants to be just a dot but is actually any character at the beginning, but allows as an alternative a sequence of alphanumerics followed by the very same dot that isn't just a dot. And of course, it overlooks the fact that www is a sequence of alphanumerics...
    • subsequent "words" ("domain-name" parts) are allowed to have - in them, but the first is not.
    • the TLD part is limited to six characters. a look at http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains suggest that while this copes reasonably well with the majority of domains, it isn't at all adequate. Things like .cancerresearch strike me as being (a) stupid and (b) rather longer than six characters.
    • IPv4 address URLs are, indeed, not matched, but because the "TLD" isn't pure alphabetic.
    • IPv6 address URLs are rejected because the "domain name" starts with a square bracket, which isn't allowed anywhere in a URL by this regex.
    • http://a..b...................com/ is accepted, but can't be resolved because DNS doesn't allow empty "labels" (parts between dots).
    • port numbers are accepted, from 0 to 99999. Most OSes don't allow the use of port zero as the destination of a connect() call, and port numbers are 16-bit unsigned ints, so the last valid one is 65535.
    • multiple port numbers are accepted (http://www.example.com:42:80:999/)
    • percent-sign URL encoding is mis-handled - http://www.example.com/%g is treated as valid, and the last time I looked, g isn't a valid hex digit, nor is it two of any kind of digit. (I'm willing to be corrected on this point - I don't have the time to hunt down whether %g is really valid in a URL, but it doesn't feel right.)

    Overall, I'd give it about half a point out of 20, because the original author of this monstrosity at least pretended to try.

  • (disco) in reply to PJH

    It's because the article uses a Java regex, not the format regex101 expects. (it expects it in the form of /<regex>/<flags> so, obviously, the forward slash would be premature end of the regex)

    https://regex101.com/r/sW4zZ6/2

    The following are matched, where the article claimed they aren't:

    http://sites.google.com www.google.com http://google.com

    That doesn't mean the regex is good, though, here are some more things that are valid http://www,google.com - with a coma, because the regex uses a dot (match anything) rather than an escaped dot (which is a dot) http://w,google.com - because it can also start with any alphanumeric and then it has dot, rather than escaped dot.

    news://www.example.com (only http, https, ftp, and ftps allowed)

    Yes, it does not need to match, depending on what URLs are supposed to be valid. Pointing it out as a mistake is like claiming file://C:\My Documents\MyPicture.jpg is supposed to be a valid image to put in a message. In fact, on Firefox that doesn't want to be resolved.

    http://www.test.com?pageid=123&testid=1524 (no url parameters)

    Yes URL parameters. The regex does require the trailing slash, so http://www.test.com/?pageid=123&testid=1524 matches, but that doesn't mean URL parameters are not supported.

    ftp://user:[email protected] (no basic auth credentials allowed)

    They don't need to be allowed. It would be up to the requirements whether they could be there or not.

    So, out of the 9 problems stated, 3 are incorrectly labeled as problems (calid URLs that the regex recognises as valid), one is completely made up (news:// doesn't need to be valid), one is a problem but the explanation why is completely off (URL parameters not being allowed vs an end slash being mandatory), and one is entirely dependant on assumed requirements that are not specified (basic credentials for FTP links). Which means there are only 3 entirely valid problems found. There were at least a couple of very glaring, very obvious issues not found, either (www,).

  • (disco)

    Thanks to all the ? and * after most of the groups, this regex matches all kinds of things that aren't URLs. For starters, the string "aa.a.aa" matches this regexp and I'm pretty sure that's not a valid URL.

  • (disco)

    I'd like to see the proof that a URL is not a regular language. It's simple: use the pumping lemma. You've got one hour.

  • (disco) in reply to pedantic_git

    .aa doesn't exist, yes. However, xy.i.de would be valid.

  • (disco) in reply to Robert_Morson
    Robert_Morson:
    I suppose it would make more sense to validate the URL using 25 MB of nested if-else statements?
    :headdesk:

    No, you use a parser for a LL or LR grammar...really folks, parser combinators have been a thing for how long now?

  • (disco) in reply to Steve_The_Cynic
    Steve_The_Cynic:
    http://a..b...................com/ is accepted, but can't be resolved because DNS doesn't allow empty "labels" (parts between dots).

    In fairness, Discourse thinks it’s a valid URL, too.

    Steve_The_Cynic:
    port numbers are accepted, from 0 to 99999. Most OSes don't allow the use of port zero as the destination of a connect() call, and port numbers are 16-bit unsigned ints, so the last valid one is 65535
    You probably don’t want to check that with a regex, tough...
  • (disco) in reply to VinDuv
    VinDuv:
    In fairness, Discourse thinks it’s a valid URL, too.
    Wait, what? Are you being fair toward Discourse? Or do you mean that there are other things that are just as crap as Discourse?
    VinDuv:
    You probably don’t want to check that with a regex, tough...
    Well, no, I wouldn't want to do that.
  • (disco)

    news://www.example.com (only http, https, ftp, and ftps allowed)

    Isn't that what you usually want though?

    There's really no way to say what kind of validation is best if we don't know what the URLs are used for. If you need to retrieve something from it, it's usually best to just pass them to the library that does that and let it handle it.

  • (disco) in reply to VinDuv
    VinDuv:
    You probably don’t want to check that with a regex, tough...

    I've had a go.

    /^([0-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5]|\d{0-4})$/
    
  • (disco)

    I tried to access http://www.⌘.ws/

    [image]
  • (disco) in reply to Keith
    Keith:
    /^([0-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5]|\d{0-4})$/
    Ugly, but it works. And then you do something like
    var validPortRegex = "whateveryoutypedthere";
    var validPortRegexOption = "(:" + validPortRegex + ")?";
    ...
    var urlRegex = protocol + "://" + domain + validPortRegexOption + ...;
    

    That's reasonably readable (apart from the escaping you need). I'm still curious if anyone can prove URLs are not a regular language. I've never seen a construction like a^n b^n or x w x.

  • (disco) in reply to Hanzo

    AFAIK URI's are a regular language. this guy says so: http://cmsmcq.com/mib/?p=306 BNF

  • (disco) in reply to Keith

    http://regexlib.com/REDetails.aspx?regexp_id=2814

    :(6553[0-5]|655[0-2][0-9]\d|65[0-4](\d){2}|6[0-4](\d){3}|[1-5](\d){4}|[1-9](\d){0,3})

    T'interweb™. Never Short Of Idiots™.

  • (disco) in reply to pedantic_git
    pedantic_git:
    For starters, the string "aa.a.aa" matches this regexp and I'm pretty sure that's not a valid URL.

    Why isn't it a valid URL?

  • (disco) in reply to PJH
    PJH:
    T'interweb™. Never Short Of Idiots™.

    Quite similar to mine except this bit is wrong:

    655[0-2][0-9]\d
    

    and I'm not sure why you'd type this:

    [1-9](\d){0,3}
    

    instead of this:

    \d{0-4}
    

    Did we decide that 0 is invalid?

  • (disco) in reply to Dragnslcr
    Dragnslcr:
    Why isn't it a valid URL?

    NXDOMAIN? :laughing:

    [pjh@sofa ~]$ whois a.aa
    No whois server is known for this kind of object.
    [pjh@sofa ~]$ nslookup a.aa
    Server:         8.8.8.8
    Address:        8.8.8.8#53
    
    ** server can't find a.aa: NXDOMAIN
    
  • (disco) in reply to Keith
    Keith:
    and I'm not sure why you'd type this:
    [1-9](\d){0,3}
    

    instead of this:

    \d{0-4}
    

    One matches 0999, the other doesn't.


    @discoursebot - formatting in the quote!

  • (disco) in reply to PJH

    Am I not allowed to express my port as 00080?

  • (disco) in reply to Keith

    Why are we still discussing this? Stop discussing this @PJH.

  • (disco) in reply to Keith

    Validating a URL should use the same philosophy as validating an email: check if it exists.

    Fires off a request, if it's not 404* you're good to go

    *probably other codes, too. I dunno, do I look like a web guy?

  • (disco) in reply to Keith
    Keith:
    Am I not allowed to express my port as 00080?

    Not if what subsequently parses it presumes the leading 0 means something, and decides that that particular thing isn't actually a number it recognises.

  • (disco)

    I'm not sure, it's kind of a tossup, but maybe that code is more reliable than this I once saw somewhere (redone from memory):

    boolean validURL(String url)
    {
        return url.length > 7 && url.substring(0,7).equals("http://");
    }
    
  • (disco) in reply to Steve_The_Cynic
    Steve_The_Cynic:
    it checks carefully for www followed by what it wants to be just a dot but is actually any character at the beginning, but allows as an alternative a sequence of alphanumerics followed by the very same dot that isn't just a dot. And of course, it overlooks the fact that www is a sequence of alphanumerics...

    There's no + after the alphanumeric character class. I'm not the most fluent regex-mangler in the world and I think this is Java which I've not been near for a while, so correct me if I'm wrong but I believe it's precisely one alphanumeric character followed by (any character whatsoever, but they probably meant a dot).

    Of course since the . wasn't escaped ww would match, and the remaining w. would match the next part, so the special case is still redundant.

  • (disco) in reply to Keith
    Keith:
    Am I not allowed to express my port as 00080?

    Buzz

    Keith:
    Stop discussing this

    I think you're right.

  • (disco) in reply to Keith

    This regex still allows port 0.

  • (disco)
  • (disco)

    It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site. Yeah, all of us have probably seen interesting URL and email validation, maybe even wrote one or two in their lives, it's just poorly written code, and as a joke, pretty much anticlimactic. Regexes are irreplaceable in some cases, and far from requiring a cryptic spaghetti to solve the problem.

  • (disco) in reply to PJH

    That's silly. Maybe I'm using a URL validator to make sure that the new URL someone entered for a new deployment is a valid URL. There is a difference between an invalid URL and a valid URL that just doesn't currently resolve to anything.

  • (disco) in reply to uiron
    uiron:
    It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site.

    :frystare:

    @antiquarian, I think we have another one for you.

  • (disco) in reply to boomzilla

    I've seen better.

  • (disco) in reply to uiron
    uiron:
    It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site.

    This feels like a challenge. What can I get past the editors and do to get all the front page commenters worked up? :laughing: :trollface:

  • (disco) in reply to VinDuv
    VinDuv:
    In fairness, Discourse thinks it’s a valid URL, too.

    Well, that's it then. It must be valid; we all know how good Discourse is at parsing and validating stuff.</sarcasm>

  • (disco) in reply to mott555

    Redefining front-page trolling?

  • (disco) in reply to mott555

    hah was not expecting such backlash :) was it this rude? I genuinely apologize for the comment then, no excuses.

  • (disco) in reply to uiron
    uiron:
    it's just poorly written code, and as a joke, pretty much anticlimactic

    They can't all be winners.

    Sincerely, Jane Bailey

  • (disco) in reply to uiron

    I wasn't the author of this piece so I'm not offended, but I do like to troll. For the record, I enjoyed the backlash against the Hanzo articles.

Leave a comment on “How to Validate a URL”

Log In or post as a guest

Replying to comment #:

« Return to Article