Comment On Irregular Expression

I haven't posted a "WTF" Regular Expression before today because RegEx's are a fairly complex animal. Well, that and I haven't gotten a submission for one. They're incredibly useful tools for parsing and validation, but I think most (including myself) are glad that there are places like RegExLib.com around to save us from the intricacies of lazy quantifiers, backtracking, and lookbehinds. [expand full text]
« PrevPage 1 | Page 2Next »

Re: Irregular Expression

2005-02-15 14:02 • by White Knight
Damn and in my app I really needed to get Jesus's birthday verified in a reg -- oh well back to the drawing board

Re: Irregular Expression

2005-02-15 14:10 • by Jeff S
There ain't nothing regular about THAT

Re: Irregular Expression

2005-02-15 14:19 • by Stan Rogers
Well, 0001 would have to be allowed in order to squeeze a date in (AD
or CE, your choice) 1 into a system that assumes the current century if
no century value exists (I've actually had to code that sort of thing
for historical time-line entries). 0000 does not represent a year,
since the  year before 0001 is 1 BC(with optional E). That being
said --- oy, ve!!!!



Even if the system had no native date support (is there one?), there
are easier (and more maintainable) ways of validating a formatted date
than RegEx. Okay -- test for and fail on "not-digit, not-virgule"
(deliberately not in character group format), or strip (replace with
nuthin') before continuing if desired -- but that should be about the
end of the game. As powerful as RegEx is, it is also nearly unreadable
at the best of times when taken in quantity. Any code monkey coming
behind can read (or learn how to read) something short like the
"not-digit, not-virgule" example, and can adjust allowable dates using
alternative methods (the Boss doesn't like April 7 -- ever). What
happens to the RegEx when the boss doesn't like April 7?

Re: Irregular Expression

2005-02-15 15:04 • by
This is my all time favourite regexp:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Re: Irregular Expression

2005-02-15 15:07 • by Jeff S
I can see it now:

 

"Johnson!  Great job on that date validation code!  Works great.  We need a minor change, though -- it should only allow weekdays, no weekends.  Shouldn't be too hard with your programming expertise, huh?  Have it ready by tomorrow."

Re: Irregular Expression

2005-02-15 15:13 • by Stan Rogers
29728 in reply to 29726
Yes, indeed -- that parrot is deceased [:|] Thanks for the link. WOW!

Re: Irregular Expression

2005-02-15 15:19 • by
29729 in reply to 29726
:
This is my all time favourite regexp:



http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html







Blink Blink,,  Waaaaaaaaaaa[:'(]

Re: Irregular Expression

2005-02-15 15:26 • by Jeremy Morton

Obviously, the built in date validation routines should have been used; much easier to tell exactly what he's trying to do that way. If there weren't validation routines [and there was], the more verbose approach with if statements is a million times easier to comprehend, test, and change.


That said, I did read through the RegEx, and, on a first pass, it looks like it will do what I'm guessing was intended. As Matt says, even granting the context of using a RegEx for this, the mixture of non-capturing and capturing groupings when none are backreferenced [and none look reasonable to be backreferenced] is at least somewhat of a WTF. The weird year construction seems to be due to the fact that I don't know of a way to specify a multiple character string to not match in the middle of a RegEx that you are trying to match; he couldn't say "match any four digits in a row, except 0000." Of course, if year != "0000" is very easy and should be a clue to approach the problem in a different way.

Re: Irregular Expression

2005-02-15 15:37 • by
...which ends up allowing any four digits EXCEPT 0000. But 0001 is valid.
That
would be a good thing, 0000 isn't a valid date anyway.  Nor is 0,
or 00, or 000.  There is 1 BC, then 1 AD.  There is no year 0.

Re: Irregular Expression

2005-02-15 15:53 • by
29732 in reply to 29731
:
...which ends up allowing any four digits EXCEPT 0000. But 0001 is valid.
That
would be a good thing, 0000 isn't a valid date anyway.  Nor is 0,
or 00, or 000.  There is 1 BC, then 1 AD.  There is no year 0.





The point was, this is all for a company Intranet site, and we
certainly haven't been in business long enough to worry about allowing
any numbers into a "request date" form before this millenia, let alone
two thousand years ago.

Re: Irregular Expression

2005-02-15 15:57 • by
29733 in reply to 29730
The built in validation routines are tightly nesteled into system.web so you would not want to use it in a desktop app

Re: Irregular Expression

2005-02-15 16:40 • by
29737 in reply to 29733
The built in validation routines are tightly nesteled into system.web so you would not want to use it in a desktop app




How odd - why not?  Breathes there a desktop
so finely tuned that its user has deleted all the system libraries that
aren't relevant to the precise setup?  Breathes there a language
programmer who hasn't read up on smart linking and dead code
elimination?  Surely including a few routines from one library
wouldn't also include ten million others unless they were directly
referenced?



(I ask from a position of unaccustomed ignorance here, since the
languages I use are smarter than this, but I realise it's possible
they're not normal.)



Incidentally, obWTF: {^(\d+)[-./](\d+)[-./](\d+)$}
should do the trick; anything further inside a regex is a sign of a
diseased mind.  Validate data in code, not in your regexes; that,
or wait for Perl 6, which looks like it finally fixes regexes permanently.

Re: Irregular Expression

2005-02-15 16:41 • by
29738 in reply to 29737
(Sigh... this forum software is really broken, you know that?  Anyone else seeing my last comment in Flyspeck 3pt?)

Re: Irregular Expression

2005-02-15 16:57 • by
29739 in reply to 29738
:
(Sigh... this forum software is really broken, you know
that?  Anyone else seeing my last comment in Flyspeck 3pt?)

Yep.

Re: Irregular Expression

2005-02-15 17:38 • by
29741 in reply to 29738

:
(Sigh... this forum software is really broken, you know that?  Anyone else seeing my last comment in Flyspeck 3pt?)


 


another WTF   

Re: Irregular Expression

2005-02-15 17:46 • by mjwills
29743 in reply to 29733





The built in validation routines are tightly nesteled into system.web so you would not want to use it in a desktop app


 


There are other options, obviously. In VB.NET, IsDate would be an obvious choice (and C# has similar functionality).               

Re: Irregular Expression

2005-02-15 17:47 • by mjwills
29744 in reply to 29738

:
(Sigh... this forum software is really broken, you know that?  Anyone else seeing my last comment in Flyspeck 3pt?)


 


Evidently you have not learned what the word 'preview' means.

Re: Irregular Expression

2005-02-15 17:48 • by
29745 in reply to 29726
:
This is my all time favourite regexp:



http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html






For some reason, this one reminds me of BrainFuck. [:S]

Re: Irregular Expression

2005-02-15 18:59 • by
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html




Holy CRAP. At any point did it ever occur to that guy that maybe, just
maybe, a regex wasn't a good solution to that particular problem?



Re: Irregular Expression

2005-02-15 19:38 • by foxyshadis
29747 in reply to 29746
It's _fast_! You can't argue with fast! (It's a good argument that the
grammar of the address field is too complex. It's a good argument for
committal of the regex creator, too.)



One of the biggest changes to perl 6 will be turning regexes from an
increasingly overburdened and complicated grammar, into something more
context-based and . Sure it won't all fit into 400 characters of line
noise, but it also won't have to be recreated every time it needs to
change because even the author needs an hour to puzzle it out later.



btw, tinyanon, you should use \d{1,2}[./-]\d{1,2}[./-]\d{2,4} etc. Unless months with hundreds of days are now in vogue. ^_~

Re: Irregular Expression

2005-02-15 20:38 • by phx
29750 in reply to 29747

99% of the time you need to parse the date anyway, so just stuff it into DateTime.Parse and see if it complains ;) Faster than validating, then parsing XD

Re: Irregular Expression

2005-02-15 21:43 • by





It's _fast_! You can't argue with fast!




Actually I can argue with fast. Who cares how fast it is when nobody
other than the author can understand it? From my point of view its
effective performance is ZERO.



http://www.codinghorror.com/blog/archives/000185.html



code that makes sense is code which can be analyzed and maintained, and that
makes it performant.



Re: Irregular Expression

2005-02-15 23:02 • by bat
29756 in reply to 29744
Evidently you have not learned what the word 'preview' means.




Fug smucker, aren't you?




Yep - first time I've ever posted without previewing.  Goes to show.




Seems to be a bug relating to indented text.  Lemme check... 
Nope.  Maybe it was the italics at the start... Nope.  Must
be related to selecting a bit of text from a parent message and pasting
it in.  I could probably debug that...  Naah.




Re: Irregular Expression

2005-02-15 23:12 • by Blue
29760 in reply to 29756
Word, bat.



I still haven't figured it out either... But in all (or at least most
of that I can recall) the cases where it happened to me, I didn't copy,
paste, or quote anything that I can recall.   Just clicked
"Reply", typed in some text, and clicked "Post".  I see no need to
have to preview "trivial" postings just to make sure the forum software
(or more specifically, the edit control) didn't screw me in the
process...



Re: Irregular Expression

2005-02-15 23:14 • by Blue
29761 in reply to 29760
Actually, after thinking about it more, I did in fact cut and paste a
few times without remembering to paste into notepad and recopy. 
That must be related...



Ok, so I'm a doofus who should preview his posts.  Fair enuff.



Re: Irregular Expression

2005-02-15 23:21 • by
29762 in reply to 29760
Don't worry, the RSS gateway is a bit munted too -- not only can't you
see any of the comments, but every few days I mysteriously get a
duplicate copy of the last 11 posts, for no readily apparent reason.



(And I'm fairly sure it's not my reader, because I'm subscribed to a
number of feeds and this is the only one it happens to.)  Sigh.

Re: Irregular Expression

2005-02-16 00:01 • by MHaggag
29763 in reply to 29762
:
Don't worry, the RSS gateway is a bit munted too -- not only can't you
see any of the comments, but every few days I mysteriously get a
duplicate copy of the last 11 posts, for no readily apparent reason.



(And I'm fairly sure it's not my reader, because I'm subscribed to a
number of feeds and this is the only one it happens to.)  Sigh.



I get the duplicates too - Thunderbird.

Re: Irregular Expression

2005-02-16 03:11 • by
29765 in reply to 29747
\d{1,2}[./-]\d{1,2}[./-]\d{2,4}


Better yet: \d{1,2}([./- ])\d{1,2}\1\d{2}\d{2}?

Which forces the date separator to be the same either side of the month, and only allows 2 or 4 digit years. Of course, neither solution restricts months or days to be valid.

Re: Irregular Expression

2005-02-16 03:26 • by
29766 in reply to 29765
:
\d{1,2}[./-]\d{1,2}[./-]\d{2,4}




Better yet: \d{1,2}([./- ])\d{1,2}\1\d{2}\d{2}?



Which forces the date separator to be the same either side of the
month, and only allows 2 or 4 digit years. Of course, neither solution
restricts months or days to be valid.




And of course, neither of those actually validates an international ISO standard date.

Re: Irregular Expression

2005-02-16 04:00 • by
29767 in reply to 29766
:
:
\d{1,2}[./-]\d{1,2}[./-]\d{2,4}




Better yet: \d{1,2}([./- ])\d{1,2}\1\d{2}\d{2}?



Which forces the date separator to be the same either side of the
month, and only allows 2 or 4 digit years. Of course, neither solution
restricts months or days to be valid.




And of course, neither of those actually validates an international ISO standard date.



...which is exactly what?

Re: Irregular Expression

2005-02-16 04:20 • by
29768 in reply to 29753
:


Actually I can argue with fast. Who cares how fast it is when nobody
other than the author can understand it? From my point of view its
effective performance is ZERO.








If you download and read the source code for module
Mail::RFC822::Address you will notice that it is quite easy to read and
understand, presuming that you have some understanding of regural
expressions.  The big beast on that page is only for display
purposes.



Re: Irregular Expression

2005-02-16 04:39 • by tinoh
29769 in reply to 29767
:







  wrote:











  wrote:




<blockquote><table width="85%"><tbody><tr><td class="quoteTable"><table width="100%"><tbody><tr><td class="txt4" valign="top" width="100%">\d{1,2}[./-]\d{1,2}[./-]\d{2,4}</td></tr></tbody></table></td></tr></tbody></table></blockquote>
<br>
<br>Better yet: \d{1,2}([./- ])\d{1,2}\1\d{2}\d{2}?
<br>
<br>Which forces the date separator to be the same either side of the
month, and only allows 2 or 4 digit years. Of course, neither solution
restricts months or days to be valid.
<br>
<br>
And of course, neither of those actually validates an international ISO standard date.<br>



...which is exactly what?


Here's a doc on the subject: http://www.cl.cam.ac.uk/~mgk25/iso-time.html


(I'm starting to understand the gripes about the forum software. Is it /really/ necessary to use a bleeping word processor to compose these posts? Not to mention one that doesn't even work in one of the popular alternatives to that other wtf that people use to infect their computers with spyware.</rant>)

Re: Irregular Expression

2005-02-16 04:41 • by
Now I don't know much about RegExp but does this code try and validate for 29th Feb only on leap years? If so, is it doing it properly (every 100 years it's not a leap year unless it's also a multiple of 400 see: http://www.codeproject.com/datetime/leap_year.asp) or just the 'is the year divisible by 4' rule?

Re: Irregular Expression

2005-02-16 05:27 • by KoFFiE
29771 in reply to 29747
foxyshadis:
It's _fast_! You can't argue with fast! (It's a good argument that the
grammar of the address field is too complex. It's a good argument for
committal of the regex creator, too.)

...


I can assure you, writing the check hardcoded will be a lot faster [:)] (at least if you're working in a compiled language and not an interpreted where the regex is a native library, then it could become close, depending on the language)

Anyway - I hate regex in code, it's a script thing, a quick hack, a commandline tool, but please not in code... It's hell to debug or extend something like that.

For me it's the same as invoking a perl-interpreter to execute a small perl script because that particular thing is easyer to write in perl, and sadly enough - I can't say I haven't seen such practices. Ok - the guy that did that was so nice to add a comment where he explained what the perl script did, but it was slight overkill to add a complete perl-installation to a windows-client program that was supposed to be "lightweight"... His argument was also "yeah but perl regex is fast"... [:@]

Re: Irregular Expression

2005-02-16 07:46 • by Irrelevant
that crazy address regex could be simplified greatly by splitting it up into sensible parts, like:

$mailto = qr/(?#...)/;

$http = qr/(?#...)/;

#...

$address = qr/$mailto|$http|(?#...)/o;

the /o on the end of the last one means it'll only be compiled once, so this should be no slower.



(and yes, I know it's only supposed to be an example of why you don't want to do it by regex)

Re: Irregular Expression

2005-02-16 08:24 • by
29776 in reply to 29770
:
Now I don't know much about RegExp but does this code
try and validate for 29th Feb only on leap years? If so, is it doing it
properly (every 100 years it's not a leap year unless it's also a
multiple of 400 see: http://www.codeproject.com/datetime/leap_year.asp) or just the 'is the year divisible by 4' rule?




Yep. It's been a little while since I broke the regex down and tried to
figure it out (and submitted it here) but I believe one of the three
main groups in the expression was devoted solely to that.





Re: Irregular Expression

2005-02-16 11:55 • by
I appreciate that this is kinda missing the point but there was no year "0000" but there was a year "0001" so that feature is kinda ok depending on whether we want dates going back that far...

Re: Irregular Expression

2005-02-16 14:27 • by logistix
29795 in reply to 29773
Irrelevant:
that crazy address regex could be simplified greatly by splitting it up into sensible parts, like:
$mailto = qr/(?#...)/;
$http = qr/(?#...)/;
#...
$address = qr/$mailto|$http|(?#...)/o;

the /o on the end of the last one means it'll only be compiled once, so this should be no slower.

(and yes, I know it's only supposed to be an example of why you don't want to do it by regex)


Actually the is the coup-de-taut from Mastering Regular Expressions, so it's supposed to be an example of RexEx zen.  I believe the point (if memory serves me correctly) is that it doesn't have any NFA-style rollbacks, so its really fast and doesn't end up consuming a lot of money in the process of NFA-to-DFA conversion by a regex compiler.

Re: Irregular Expression

2005-02-16 14:37 • by Drak
29800 in reply to 29795

I'm sure you mean a coup d'état. Okay I have no idea where that link is gonna go...


I'm all for using regular expressions instead of a series of 'instr' commands in VB.Net. Just don't make them too complex. It needs to be maintainable too.


Drak

Re: Irregular Expression

2005-02-16 14:46 • by Blue
29805 in reply to 29800
The main thing I hate about using regex in code is that you have to
escape (ie \", or even worse, escaping the \'s in the expression so
they become \\) so many of the metacharacters, etc, that it becomes a
nightmare to seperate the actual regex expression from the mangling you
had to do to get it into a string variable.



Thank god C# allows the literal string construct (prefix with @), so it is no longer quite so bad for me.







Re: Irregular Expression

2005-02-16 17:02 • by JamesCurran
29823 in reply to 29800

Drak:
I'm sure you mean a coup d'état.


Actually, I think he mean coup de grâce.

Re: Irregular Expression

2005-02-16 18:23 • by Katja
29828 in reply to 29823
JamesCurran:








 Drak wrote:




I'm sure you mean a coup d'état.


Actually, I think he mean coup de grâce.



I prefer a coup soleil, though... [:P]

Re: Irregular Expression

2005-02-16 18:24 • by Stan Rogers
29829 in reply to 29823
I don't know about "coup de grace" either -- which is usually defined
as a mercy stroke, designed to kill a (usually) badly wounded foe who
would suffer unnecessarily otherwise. (A death blow given in other
contexts may be wrongly termed a coup de grace in English, but it
misses the whole "grace" part of the deal.) Coup d'état, a sudden,
violent overthrow of the government, is definitely wrong. Coup de génie
(stroke of genious) may fit, but it's hardly a common find in English,
as would chef d'oeuvre (masterpiece). The most probable fit for
French-originated-but-common-in-English phrases would be "tour de
force"; the effect it has on you may be likened to a "coup de foudre".

Re: Irregular Expression

2005-02-17 11:07 • by sas
29882 in reply to 29829
I think he didn't know what he meant. [N]

Re: Irregular Expression

2005-02-17 14:32 • by
29893 in reply to 29773
Irrelevant:
that crazy address regex could be simplified greatly by splitting it up into sensible parts, like:

$mailto = qr/(?#...)/;

$http = qr/(?#...)/;

#...

$address = qr/$mailto|$http|(?#...)/o;

the /o on the end of the last one means it'll only be compiled once, so this should be no slower.



No need for the /o.  The great thing about qr// is that it precompiles regexe(s|n)... :)

Re: Irregular Expression

2005-02-17 16:48 • by foxyshadis
I believe the point of the forum software is to get you in the mood for a proper appreciation of the collection of wtfs.



I didn't know ISO supported leaving the dashes and colons out. Nice,
the Exslt and XPath specs never goes over that, and probably don't
support the full 'standard'.



Why the hell would anyone include a perl binary/installer with a
compiled project? wtf? PCRE exists for a reason, and will definitely be
much faster than marshalling arguments into perl scripts, calling perl,
and (sometimes) getting the results back. Just call it all from C/C++
and be happy.

Re: Irregular Expression

2005-02-18 06:43 • by Drak
29962 in reply to 29829

Stan Rogers:
I don't know about coup de grace ... Coup d'état... Coup de génie ... chef d'oeuvre ... "tour de force"; ... "coup de foudre".


Perhaps het just meant 'Coup':


coup (k)
n. pl. coups (kz)



  1. A brilliantly executed stratagem; a triumph.


    1. A coup d'état.
    2. A sudden appropriation of leadership or power; a takeover: a boardroom coup.

  2. Among certain Native American peoples, a feat of bravery performed in battle, especially the touching of an enemy's body without causing injury.

Sorry, I overlooked the fact that d'etat wasn't in there for definition 1 the first time round[:S]


Drak

Re: Irregular Expression

2005-02-20 06:22 • by
Re

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html



I once worked at one of the first companies to offer design-your-own
egreet.  Well, it was a pay service, and the folks using it
weren't too hip to tech, and we really really wanted people to get the
messages (so we'd have happy customers).



I was tasked with writing an email address validation routine that would specify exactly what was wrong with the address.



It checked parts, it validated domain names, etc.



It would return messages like "the domain name (after the @ sign) must not begin with a numeral."



It took about 3 days and was about 400 lines of VBScript. 



I'm not sure if it actually sold any more cards...



It wasn't until much later that I realized a parser was probably already available (though doubtfully in VBScript.

Re: Irregular Expression

2005-02-21 06:06 • by bat
30063 in reply to 30047
I saw the ex-parrot URL and
then read the next like as being about a company that offered a
"design-your-own egret".  I skimmed the rest of the comment
looking for other references to birds, assuming this was some meme new
to the blogosphere and before I knew it I'd be knee-deep in obscure
species of winged creature.  All your geese are belong to us?



Topic?  What topic?


Re: Irregular Expression

2005-02-21 08:46 • by JamesCurran
30066 in reply to 30047

:
Re
It would return messages like "the domain name (after the @ sign) must not begin with a numeral."


There is, in fact, nothing wrong with a domain name starting with a numeral.

« PrevPage 1 | Page 2Next »

Add Comment