- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Hmmm, the article seems to have an incorrect assertion of one of the test cases.
www.google.com
actually matches, thanks to the?
after the http branch.Another WTF with this is expression is the inconsistent escaping of dots, e.g. the regex matches
wwwagoogle.com
.Admin
Even
google.com
matches, I think. This article is TRWTF.Admin
Not to forget,
http://sites.google.com
works, too, because thewww.
is OR-ed with alphanum.Admin
https://www.regex101.com/ reports 3 unmatched
/
delimiters as well.Admin
So the bug is the opposite of that asserted in the article? That's not a URL and shouldn't validate as one.
Admin
Depending on the context, it could be a valid URL (you could buttume it's http like most browsers do nowadays).
My main gripe is that the article starts off with a nice list of things to be validated and then somehow gets those test results wrong.
Admin
Wait, did you actually type "buttume", or is it just a wordfilter of assume?
Admin
Your conclusion is TRWTF. I'm hearing, "This is an incorrect regex. Therefore, regex should not be used for this purpose."
If that line of logic made sense, I could find examples that proved that every programming language in the world was useless. I suppose it would make more sense to validate the URL using 25 MB of nested if-else statements?
Admin
Also not validating:
Admin
It's a clbuttic in-joke, look it up in the brillant meme wiki.
Edit: Hanzo'd. (That's another meme I'm now too lazy to find the source of)
Admin
All the problems people point out are symptoms of the main Regex problem: they're all but impossible to read
Admin
well that and they're only meant to parse regular language....... and pretty much nothing is a pure regular language...
Admin
The regex is buggy / has a lot of redundant info in it. I don't think a regex for a URL would be too difficult and I'd question the need for upper/lower case matching as I think the ignore case option could be used. If the author did not support the news protocol then not including it in the regex is not a bug. I wrote an http client. If someone passed in ftp:// I'd return an error (actually I'd just crash since I don't care :wink:).
Admin
(1) So that they work correctly, that is.
Other oddities:
www
followed by what it wants to be just a dot but is actually any character at the beginning, but allows as an alternative a sequence of alphanumerics followed by the very same dot that isn't just a dot. And of course, it overlooks the fact thatwww
is a sequence of alphanumerics...-
in them, but the first is not.g
isn't a valid hex digit, nor is it two of any kind of digit. (I'm willing to be corrected on this point - I don't have the time to hunt down whether %g is really valid in a URL, but it doesn't feel right.)Overall, I'd give it about half a point out of 20, because the original author of this monstrosity at least pretended to try.
Admin
It's because the article uses a Java regex, not the format regex101 expects. (it expects it in the form of /<regex>/<flags> so, obviously, the forward slash would be premature end of the regex)
https://regex101.com/r/sW4zZ6/2
The following are matched, where the article claimed they aren't:
http://sites.google.com www.google.com http://google.com
That doesn't mean the regex is good, though, here are some more things that are valid http://www,google.com - with a coma, because the regex uses a dot (match anything) rather than an escaped dot (which is a dot) http://w,google.com - because it can also start with any alphanumeric and then it has dot, rather than escaped dot.
Yes, it does not need to match, depending on what URLs are supposed to be valid. Pointing it out as a mistake is like claiming file://C:\My Documents\MyPicture.jpg is supposed to be a valid image to put in a message. In fact, on Firefox that doesn't want to be resolved.
Yes URL parameters. The regex does require the trailing slash, so http://www.test.com/?pageid=123&testid=1524 matches, but that doesn't mean URL parameters are not supported.
They don't need to be allowed. It would be up to the requirements whether they could be there or not.
So, out of the 9 problems stated, 3 are incorrectly labeled as problems (calid URLs that the regex recognises as valid), one is completely made up (news:// doesn't need to be valid), one is a problem but the explanation why is completely off (URL parameters not being allowed vs an end slash being mandatory), and one is entirely dependant on assumed requirements that are not specified (basic credentials for FTP links). Which means there are only 3 entirely valid problems found. There were at least a couple of very glaring, very obvious issues not found, either (www,).
Admin
Thanks to all the ? and * after most of the groups, this regex matches all kinds of things that aren't URLs. For starters, the string "aa.a.aa" matches this regexp and I'm pretty sure that's not a valid URL.
Admin
I'd like to see the proof that a URL is not a regular language. It's simple: use the pumping lemma. You've got one hour.
Admin
.aa doesn't exist, yes. However, xy.i.de would be valid.
Admin
No, you use a parser for a LL or LR grammar...really folks, parser combinators have been a thing for how long now?
Admin
In fairness, Discourse thinks it’s a valid URL, too.
You probably don’t want to check that with a regex, tough...Admin
Admin
Isn't that what you usually want though?
There's really no way to say what kind of validation is best if we don't know what the URLs are used for. If you need to retrieve something from it, it's usually best to just pass them to the library that does that and let it handle it.
Admin
I've had a go.
Admin
I tried to access http://www.⌘.ws/
[image]Admin
That's reasonably readable (apart from the escaping you need). I'm still curious if anyone can prove URLs are not a regular language. I've never seen a construction like a^n b^n or x w x.
Admin
AFAIK URI's are a regular language. this guy says so: http://cmsmcq.com/mib/?p=306 BNF
Admin
http://regexlib.com/REDetails.aspx?regexp_id=2814
T'interweb™. Never Short Of Idiots™.
Admin
Why isn't it a valid URL?
Admin
Quite similar to mine except this bit is wrong:
and I'm not sure why you'd type this:
instead of this:
Did we decide that 0 is invalid?
Admin
NXDOMAIN
? :laughing:Admin
One matches 0999, the other doesn't.
@discoursebot - formatting in the quote!
Admin
Am I not allowed to express my port as 00080?
Admin
Why are we still discussing this? Stop discussing this @PJH.
Admin
Validating a URL should use the same philosophy as validating an email: check if it exists.
Fires off a request, if it's not 404* you're good to go
*probably other codes, too. I dunno, do I look like a web guy?
Admin
Not if what subsequently parses it presumes the leading 0 means something, and decides that that particular thing isn't actually a number it recognises.
Admin
I'm not sure, it's kind of a tossup, but maybe that code is more reliable than this I once saw somewhere (redone from memory):
Admin
There's no + after the alphanumeric character class. I'm not the most fluent regex-mangler in the world and I think this is Java which I've not been near for a while, so correct me if I'm wrong but I believe it's precisely one alphanumeric character followed by (any character whatsoever, but they probably meant a dot).
Of course since the . wasn't escaped ww would match, and the remaining w. would match the next part, so the special case is still redundant.
Admin
Buzz
I think you're right.
Admin
This regex still allows port 0.
Admin
Well, at least nobody tried to parse HTML with a regular expression.
Admin
It's a bit dissapointing to see a highly opinionated and badly informed article on this awesome site. Yeah, all of us have probably seen interesting URL and email validation, maybe even wrote one or two in their lives, it's just poorly written code, and as a joke, pretty much anticlimactic. Regexes are irreplaceable in some cases, and far from requiring a cryptic spaghetti to solve the problem.
Admin
That's silly. Maybe I'm using a URL validator to make sure that the new URL someone entered for a new deployment is a valid URL. There is a difference between an invalid URL and a valid URL that just doesn't currently resolve to anything.
Admin
:frystare:
@antiquarian, I think we have another one for you.
Admin
I've seen better.
Admin
This feels like a challenge. What can I get past the editors and do to get all the front page commenters worked up? :laughing: :trollface:
Admin
Well, that's it then. It must be valid; we all know how good Discourse is at parsing and validating stuff.</sarcasm>
Admin
Redefining front-page trolling?
Admin
hah was not expecting such backlash :) was it this rude? I genuinely apologize for the comment then, no excuses.
Admin
They can't all be winners.
Sincerely, Jane Bailey
Admin
I wasn't the author of this piece so I'm not offended, but I do like to troll. For the record, I enjoyed the backlash against the Hanzo articles.