The Daily WTF: Curious Perversions in Information Technology

2007-02-19 Reply Admin

AssimilatedByBorg:
*Sigh*
I wish I had a nickel for every time someone asked me to write code to validate email addresses, and thought it was simple.

I gently try to explain the incredible formats of addresses that are actually valid, and that eventually, they will really annoy somebody by using their restrictive idea of an email address.

The smart ones understand that.

It's the other kind of people that I don't know what to do with. The kind of people who ask, "why did your validation routine let through an email address with a typo in it?" (Seriously. This has happened.)

Oh, please. What kind of idiot can't write a validation routine that sends an email, then checks the user's email account to see if they got it?

2007-02-19 Reply Admin

Who does quoted local parts? Plenty of people. I parsed most of the incoming email for a major tech company's email support once. While most messages didn't use quoted local parts there was still a significant portion that did.

2007-02-19 Reply Admin

mathew:
The only validation of e-mail addresses I do is to check it matches .+@.+\..+
i.e. there's an @, there are characters on both sides of the @, there's at least one . to the right of the @, there are characters on both sides of the .

Remember, no syntactic check is going to determine whether an e-mail address is actually valid and working. All you're checking for is obvious brokenness like putting in localhost-specific user IDs, or putting their name in the field instead of their e-mail address.

Hooray! Someone gets it!

2007-02-19 Reply Admin

I understand the complexities involved in validating email addresses. But why in God's name do websites tell me that my phone number is invalid or doesn't match my zip code? My area code has been around for about 8 years, when Verizon decided to inconvenience everyone forever instead of a smaller number of people for a short time by beginning the practice of overlays.

2007-02-20 Reply Admin

Asd:
After reading this I had a look at what the jakarta commons validator did. And found a nice mini WTF. The EmailValidator is a singleton despite it not having any state. God I wish they had never invented that pattern.
http://svn.apache.org/viewvc/jakarta/commons/proper/validator/trunk/src/main/java/org/apache/commons/validator/EmailValidator.java?view=markup

Most likely somebody initially had a singleton because they were compiling the regexes when the class was loaded. If you're validating lots of email addresses, you can save a load of time by compiling (and possibly optimizing) the regexes.

Unfortunately somebody else probably then decided that they should just "fix" it so that it compiles the regexes whenever they're used, but then either didn't think to undo the singleton aspect, or it was too late because changing the interface would break code.

2007-02-20 Reply Admin

As far as emails being recursive... I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

2007-02-20 Reply Admin

Um - the people who might use + in the local part?

You don't need to quote an address containing a + character. And in fact the RFC says an email address should not contain characters requiring quoting. And you should not use quoting unless the address contains chars requiring it. Thus the RFC and a bit of logic implies, quoting should not be used.

And considering how tricky it is to correctly parse an SMTP command with a quoted email address, I'd say that is one of the things I'd not have a problem with rejecting. The original RFC even allowed newline and \0 in an email address as long as they were quoted and escaped.

lanzz · 2007-02-20 Reply Admin

Anonymous Tart:
Yeah, I use something much simpler even than that in internal code.
!/^$/

why not just /./ ? or even better, avoid regex altogether and use the strlen()-equivalent.

lanzz · 2007-02-20 Reply Admin

Bill:
imMute:
That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z][a-z0-9]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.

2007-02-20 Reply Admin

TheD:
PJH:
I can count on the hands of one arm...

For some reason, this was hilarious to me. Maybe I need more coffee?

:D ;P

And I have all the coffee I need in my blood

2007-02-20 Reply Admin

Email validation Regexes are like arseholes... everybody's got one but nobody wants to see the next man's.

The RFC should come with it's own RegEx.

lanzz · 2007-02-20 Reply Admin

thrashaholic:
Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.

why then is there a mechanism to catch exceptions? even more, catch SPECIFIC exceptions? unless you want to catch them because you plan for them?

lanzz · 2007-02-20 Reply Admin

BruteForce:
As far as emails being recursive... I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

could be, but regular expressions are not actually functions.

2007-02-20 Reply Admin

Gabe:
Most likely somebody initially had a singleton because they were compiling the regexes when the class was loaded. If you're validating lots of email addresses, you can save a load of time by compiling (and possibly optimizing) the regexes.
Unfortunately somebody else probably then decided that they should just "fix" it so that it compiles the regexes whenever they're used, but then either didn't think to undo the singleton aspect, or it was too late because changing the interface would break code.

I thought that too, but it is like that as far back as it goes in SVN.

2007-02-20 Reply Admin

anon:
woohoo:
I beg your pardon?
there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

They're both in JavaMail (specifically, javax.mail.internet).

I stand corrected, you are right of course for JEE, but I did in fact think of JSE. sorry for being too vague...

2007-02-20 Reply Admin

if i did not screw up entirely, the regexp at the end of the article does not cover addresses like this:

[email protected]

a friend of mine has an email-address like this (don't ask me why... i guess "stuff" was already taken).

it is perfectly valid, yet many websites and even ms outlook 2000 (or was it an older version of outlook express?) complained.

GeneWitch · 2007-02-20 Reply Admin

http://imgs.xkcd.com/comics/regular_expressions.png

hehehehehehehehehehehehe woot

2007-02-20 Reply Admin

Well, it's actually called an "ampersat" so I don't think that's it :) In my opinion, the WTF is that his 20-some lines of code do pretty much the same thing as:

/^[^@ ,]+@[^ ,]+.[^ ,]*$/

Which is a bit shorter... an would allow "?@?." as a valid e-mail address :)

2007-02-20 Reply Admin

El Quberto:
Suggan:
Actually, most validation algorithms disapproves of this perfectly valid address: me@se Why is that??

I believe because all TLD consist of 2,3 or 4 characters and so "se" would be considered a TLD. Is it? I dunno. Even if it was one you can't just email that directly, it would be like emailing "me@com"

Except that many country level TLDs do have MX records for themselves, normally for domain administrators, obviously.

I've seen bug reports requesting MTAs and MUAs fix themselves to transmit mail to/from these addresses.

EvanED · 2007-02-20 Reply Admin

BruteForce:
As far as emails being recursive... I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.

Yes, this is true. The reverse is also true.

However, regular expressions can't express everything that iterative functions can produce. For instance, it's not hard to produce an iterative function that will figure out if the parens in an expression are balanced, but it's provably impossible to write a regular expression to do so. (Unless your regular expression engine acceps things that aren't really regular expressions.)

2007-02-20 Reply Admin

Email addresses are not all defined by RFCs. There's a world beyond RFC 2822 you know, and if you have to deal with European government legacies, then you might well encounter it.

This isn't to say that such an address is reachable from teh intawebs, and almost certainly not reachable in that format. They still exist though, and they may still be a personal identifier.

2007-02-20 Reply Admin

skington:
In the same way that ages ago people used to shorten variables and eliminate white space to fit more code into 32K, or however much RAM their machine had at the time. It doesn't mean you develop that way.

Uh, eliminate white space and shorten variables to fit into 32K? Please (oh please), you must be talking about interpreted languages like BASIC.

10 PRINT What's your name? 20 INPUT A$ 30 PRINT Hello A$ 40 PRINT Would you like to play a game? 50 INPUT B$ 60 PRINT Yes, we will play thermonuclear warfare.

Ah, the memories.

CAPTCHA: pirates (Arrrrggh)

richardchaven · 2007-02-20 Reply Admin

sort it topologicaly to check for a circular dependencies!:
Janek:
This is how I do it in Java
try{ InternetAddress foo = new InternetAddress(emailCandidate); } catch (AddressException ex) { return false; } return true;

How do we know she's a witch? Let's build a bridge of her.

Thank you. That is possibly the funniest geeks-only reference I have every read.

Cheers

2007-02-20 Reply Admin

I dont work for them but I do think their software is amazing. For all of you giving out about the use of Regular Expressions and the fact that they are impossible to figure out, take a look at RegExBuddy (http://www.regexbuddy.com/) and you will never look back. It helps you to create Regular Expressions for all of us who think it is a black art. I used to but no longer.

2007-02-20 Reply Admin

You say ampersat, I say amphora.

captcha: wigwam (wtf?)

kmactane · 2007-02-20 Reply Admin

Otto:
Buzz:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. Jamie Zawinski

Some people, when confronted with regular expressions, like to quote jwz. Now they are fools who cannot cope with regular expressions.

I wish I could mod your comment up.

2007-02-20 Reply Admin

The plus-sign? One-letter names? Please. I'd be happy if web sites would just accept the damn .name TLD. It's been in use for three or four years now. Dammit!

2007-02-20 Reply Admin

Dingbat:
Email addresses are _not_ all defined by RFCs. There's a world beyond RFC 2822 you know, and if you have to deal with European government legacies, then you might well encounter it.
This isn't to say that such an address is reachable from teh intawebs, and almost certainly not reachable in that format. They still exist though, and they may still be a personal identifier.

Sure they are, at least in the context of the internet. If you want to hook up your legacy network to the internet, it's your problem translating the address, not the internet's.

real_aardvark · 2007-02-20 Reply Admin

Toby:
I dont work for them but I do think their software is amazing. For all of you giving out about the use of Regular Expressions and the fact that they are impossible to figure out, take a look at RegExBuddy (http://www.regexbuddy.com/) and you will never look back. It helps you to create Regular Expressions for all of us who think it is a black art. I used to but no longer.

Toby: Retreat from the Dark Side.

Just because you no longer think that it is a black art any more, that doesn't make the statement any less true.

Any damn moron can write a regex. Any damn moron armed with RegexBuddy can write one that works, as of yesterday. (What with RegexBuddy being fed yesterday's information.) Today, maybe. Tomorrow, the World! Except not.

Regexes, whilst they have their place, are inherently fragile. They depend upon using strings as your basic data structure, which unless you're a VB programmer, a Java programmer, or an utter moron, is unlikely to be your ideal choice of representation. Other than an ancient IBM mini from somewhere back in the '60s, the name of which I forget, computers do not think in terms of strings.

So, by all means, use regexes for simple tasks. Validating an email address is not a simple task. Nor is it, in fact, very useful. May I quote a wise man from further up this thread:

Zygo:
Kalle:
The easiest and most likely to succeed way to validate an address is to establish an SMTP session to the primary MX of the domain and do an RCPT. If the address is invalid, either you cannot establish a connection or the SMTP server returns an error. Easy :)
[And yes, I do know that the Internet mail doesn't work like that any more, more is the pity.]

For those who haven't tried it, there are four cases:

<snip> ... look up the details for yourself. Taught me a thing or two. <snip/>

And then again:

Bat:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Some people, when faced with a regular expression, think "I know, I'll use Jamie Zawinski as an excuse and cherish my ignorance". Now they've got an infinite number of problems.

Then again, there are idiots like "Bat." Y'know, batso, it is in fact possible to understand regular expressions -- even to use them now and again, as appropriate -- without being ignorant. Or, indeed, using anything as an "excuse." I believe the concept is called "design choice." Obviously either Darwin or God made a bad mistake in your case.

And, to:

richardchaven:
sort it topologicaly to check for a circular dependencies!:
How do we know she's a witch? Let's build a bridge of her.

Thank you. That is possibly the funniest geeks-only reference I have every read.

Cheers

... No, that's just Monty Python. Standard pick-at-yer-acne stuff. If you really want a geeky reference to bridges, see:

http://www.creativyst.com/Doc/Articles/Mgt/AgileBridges/AgileBridges.htm

2007-02-20 Reply Admin

woohoo:
Janek:
This is how I do it in Java
try{ InternetAddress foo = new InternetAddress(emailCandidate); } catch (AddressException ex) { return false; } return true;

I beg your pardon?

there are no classes 'InternetAddress' and 'AddressException' that I know of in the Java standard libraries.

there is a class 'InetAddress' with two subclasses 'Inet4Address' and 'Inet6Address' (for obvious reasons), but these are only usably for IP addresses, not for the full mail address scheme.

if these should be home-grown utility classes (and you do have control over it), it would be preferable to have a boolean 'isValid()' method in lieu of having to use exception handling for the control flow.

No, but there are in the JavaMail API (http://java.sun.com/products/javamail/).

Of course, you could have used Google...

2007-02-21 Reply Admin

Bill:
imMute:
That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does. Also, if the tool were worth a damn, it would also give you comments to imbed along with the regex.

Hopefully this WAS simply the output of a builder class, where the method calls used to build it provide adequate documentation. But based on the OP, I doubt it.

As the author of the regexp on http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html, a few comments:

I did not write it by hand. Further, it does not appear in the code in that form. I wrote it by translating the components of the syntax in RFC822 into regexp components, producing around 25 lines of code that map very directly to the RFC822 EBNF spec. The full regexp is compiled (through the magic of string interpolation) when the module is run.

The form that it appears in the code (go read it, it's not that bad) is perfectly maintainable by anyone with reasonable knowledge of regexps and a copy of the RFC822 EBNF to hand (and if you don't have the latter, you shouldn't be writing email address validators). None of the component regular expression assignments are longer than an 80 char line, and that's including descriptive variable names drawn from the original grammar.

I wrote it because (a) I can't stand incorrect validation - either get it right or don't do it at all and (b) I found that using regular expressions actually works better in Perl than doing it "properly" using Parse::RecDescent.

The only reason I include it on the web page in its full 4k horror is to make people understand that any significantly shorter regexp is unlikely to be complete.

In response to an earlier comment: The reason that it can't cope with comments is because RFC822 allows comments to be arbitrarily nested and there's simply no way to cope with that in a regexp. The Perl module recursively applies a regular expression in order to strip out comments before validating the remainder.

If you're interested in proving the shortcomings of some of the shorter regexps, the test script in that module contains a decent set of wierd addresses, and could easily be pointed at a different regexp (credit to the author of the RecDescent validator for most of it).

Paul

MrBester · 2007-02-21 Reply Admin

And the biggest WTF of all the comments is that they've completely forgotten that this was supposed to be a JavaScript solution...

2007-02-21 Reply Admin

El Quberto:
Suggan:
Actually, most validation algorithms disapproves of this perfectly valid address: me@se Why is that??

I believe because all TLD consist of 2,3 or 4 characters and so "se" would be considered a TLD. Is it? I dunno. Even if it was one you can't just email that directly, it would be like emailing "me@com"

I think .museum is a valid TLD.

2007-02-21 Reply Admin

I just use 'something @ something dot something' :

/.@..*/

...or something to that effect.

captcha: 'bathe' - spookily, I just have

2007-02-21 Reply Admin

Thank god my computational linguistics teacher didn't make us convert that to a NFA.

shiver

Captha: gotcha (it rhymes)

volodya · 2007-02-22 Reply Admin

Have i missed something or would this javascript mess up on [email protected] thinking that it was a valid address.

2007-02-22 Reply Admin

I don't suppose you're reading the comments because otherwise you would have fixed your regexp by now -- but I find it truly pathetic how you derive so much amusements out of making fun of less informed people while messing it up royally yourself.

It was already mentioned that the local part of an e-mail address can validly end in a ".". I would like to add that it is also perfectly valid to have consecutive dashes in the domain name (read Internationalized domain name on Wikipedia).

By posting your broken regexp you are perpetuating the same annoyance that you are ridiculing.

2007-02-22 Reply Admin

It was the end of page two of the comments before I found why anyone would WANT to validate an email address beyond two simple requirements:

it's not so hosed it does something bizarre and destructive
you can get them to send a confirmation mail.

What are you all up to that it's so important anyway? I saw this one in a forum & think I might keep it myself, and it validates fine.

[email protected]

Otherwise <alphanumericgibberish>@mytrashmail.com validates too. AND allows a validation email, unless the server is particularly snotty.

2007-02-23 Reply Admin

lanzz:
thrashaholic:
Exceptions are to be used for EXCEPTIONAL CASES that you can not plan for.
why then is there a mechanism to catch exceptions? even more, catch SPECIFIC exceptions? unless you want to catch them because you plan for them?

Not thinking of language mechanisms, there are two kinds of exceptions: the alternative flow in a usecase, i.e. well documented and testable, something you aware of while coding. And the second category are plain programming errors, so situations that leave your program in an mangled state, something you never anticipated. In the second case, the only reasonable thing to do is to abort the program (or restart it after some error mesage), continueing a program with unkown state is never a good idea.

So the question when to use exception in a language depends on the support for it. In C++ it's impossible to write exception save code for two reasons:

It takes lots of dicipline to write exception save code, i.e. code that does not leak resources. If if you can do it, the next guy maintaining your code will probably make some mistakes.
C++ has no composite exceptions, if somethings goes wrong during a stack unwind on a throw, you can only abort. In other words, an exception may never leave a destructor.

So in C++ exception can only be used as an abort trap, (the programming error case, giving you a chance to log the error before calling abort).

However in a managed language such as C# or Python, exception are an excellent flow control mechanism for handling the alternative flow of a usecase. They often really simplify code and the performance hit is a non-issue, because they usually require user input to resolve the problem.

2007-02-23 Reply Admin

Several people have said that as the standard for email addresses is recursive then there is no way to write a regular expression for it. Given that email addresses have a maximum length, can a regexp be used even though the standard is recursive? For example, there can only be a maximum of 127 full stops in the domain part.

Lemma: Any finite language is regular. Proof: S -> w_1 | w_2 | ... snip some 2^127 rules ... | w_2^127

2007-02-24 Reply Admin

I took a stab at this once. Only good for .NET.

http://www.twilightsoul.com/Domains/Voyager/Patterns/EmailAddresses/tabid/134/Default.aspx

With a pretty full explanation of what I did and did not include from the RFC 2822.

2007-02-25 Reply Admin

I see a problem in these comments (and on the Internets as well), in that people are using the word "valid" so casually that it becomes void of meaning. Before you test an email address for validity, you need to carefully define "validity". The reason why some people in these comments think "me@se" is valid and others do not could only be that the word "valid" means different things to them.

If you're going to use the string entered as "EMAIL" as the recipient in an outgoing email, then the definition of "valid" should surely be "usable as recipient", should it not?

And here's the important part: notice how "usable as recipient" is only LOOSELY related to "strictly follows RFCs 2/822". Before you attempt to validate the email field, ask yourself: are you absolutely 100% certain about what is "usable as recipient" when you're sending out mail? No? Then why pretend you can "validate"?

If you're going to check for "RFC 2822 conformity", write "RFC 2822 COMPLIANT STRING" that in the on-line help; don't write "EMAIL". And then make pretty damn sure your validator is RFC 2822 compliant.

For comparison: What about the "NAME" field - would you "validate" that according to some "must contain at least two parts, be capitalized" scheme (maybe there's an RFC, even?), or just allow anything that's "usable as name"?

Or the "PHONE NUMBER"? When Geörge Lucäs enters 1-900-STARWARS, wouldn't it be fun it that validated, since your dial-up marketers actually can use that when placing a callback to see if the knitted mittens were to the customer's liking?

2007-02-25 Reply Admin

Here is a much easier to read RFC822 compliant validating Perl 5.9.5/PCRE 7 regex courtesy of Abigail:

my $email_address = qr {
   (?(DEFINE)
     (?         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ;
                                            (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)
     (?<address_list>    (?&address) (?: , (?&address))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?"ed_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? dcontent)* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?"ed_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?"ed_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?"ed_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?"ed_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)
}x;

2007-02-25 Reply Admin

Craig:
Several people have said that as the standard for email addresses is recursive then there is no way to write a regular expression for it. Given that email addresses have a maximum length, can a regexp be used even though the standard is recursive? For example, there can only be a maximum of 127 full stops in the domain part.

Strictly speaking no, since regular expressions as defined in mathematics/computer science cannot match a recursive pattern.

However, the most commonly used "regex" engines are not actually true "regular expression" engines, and therefore /CAN/ match recursive patterns. PCRE has had this for some time, and Perl5 has had it for even longer using "dynamic patterns" and in Perl 5.9.5 you also have "recursive patterns" as well.

A good rule of thumb is if an engine is documented to do "leftmost-longest" matching then it isnt a true regular expression engine, and therefore hypothetically /can/ match a recursive pattern. Whereas if it is documented as using a DFA or NFA simulating DFA or documented to provide longest-token matching semantics then it will NOT be able to match recursive patterns.

True regular expressions make doing things like backreferences, capturing, lookaround, etc much more difficult (or perhaps impossible) than doing so with the backtracking engines commonly found in programming languages, although true regular expressions have /much/ better worst case performance than the kind you will find in Perl, Python, Java, PCRE, etc. (OTOH Perl and friends probably have better best cases.) All of these engines use backtracking-nfa's as compared to true dfa or dfa simulation. This is for a good reason, in a programming language you can typically avoid the worst case by careful pattern construction, whereas the utility of true regular expression engines is far reduced from that which a backtracking implementation can provide.

TCL has a hybrid engine, and other projects are also doing work in implementing hybrid schemes so as to avoid the worst case performance when possible.

2007-03-01 Reply Admin

I fully agree and I wrote more or less the same rant (without code) in French :

http://www.bortzmeyer.org/arreter-d-interdire-des-adresses-legales.html

2007-07-02 Reply Admin

lanzz:
Bill:
imMute:
That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z][a-z0-9]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.

2007-07-02 Reply Admin

lanzz:
Bill:
imMute:
That regex was not written by a human, it was compiled using probably Parser::RecDescent or some other module

Possibly, but matters not. The fact remains that it's unmaintainable as-is. Just because the metadata that "Documents" it might be maintained elsewhere, such as a tool, doesn't mitigate the fact that no one reading the source can be sure of what it does.

so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.

regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z][a-z0-9]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.

regexes are simply over-used for parses too complex for clean implementation in regex.

2010-10-05 Reply Admin

This solution does not work for addresses like- [email protected]

where there is a "." just before @ sign.

Has anyone found a turn around for this?

Thanks, Arunraj

nairarunraj · 2010-10-05 Reply Admin

This solution does not work for addresses like- [email protected]

where there is a "." just before @ sign.

Has anyone found a turn around for this?

Thanks, Arunraj

nairarunraj · 2010-10-05 Reply Admin

Janek:
This is how I do it in Java
try{ InternetAddress foo = new InternetAddress(emailCandidate); } catch (AddressException ex) { return false; } return true;

This solution does not work for addresses like- [email protected]

where there is a "." just before @ sign.

Has anyone found a turn around for this?

Thanks, Arunraj

Validating Email Addresses

Leave a comment on “Validating Email Addresses”