- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Admin
Who does quoted local parts? Plenty of people. I parsed most of the incoming email for a major tech company's email support once. While most messages didn't use quoted local parts there was still a significant portion that did.
Admin
Hooray! Someone gets it!
Admin
I understand the complexities involved in validating email addresses. But why in God's name do websites tell me that my phone number is invalid or doesn't match my zip code? My area code has been around for about 8 years, when Verizon decided to inconvenience everyone forever instead of a smaller number of people for a short time by beginning the practice of overlays.
Admin
Most likely somebody initially had a singleton because they were compiling the regexes when the class was loaded. If you're validating lots of email addresses, you can save a load of time by compiling (and possibly optimizing) the regexes.
Unfortunately somebody else probably then decided that they should just "fix" it so that it compiles the regexes whenever they're used, but then either didn't think to undo the singleton aspect, or it was too late because changing the interface would break code.
Admin
As far as emails being recursive... I read somewhere that all recursive functions can be rewritten as iterative functions, and they pointed to a scientific proof thereof.
Admin
And considering how tricky it is to correctly parse an SMTP command with a quoted email address, I'd say that is one of the things I'd not have a problem with rejecting. The original RFC even allowed newline and \0 in an email address as long as they were quoted and escaped.
Admin
why not just /./ ? or even better, avoid regex altogether and use the strlen()-equivalent.
Admin
so you disregard all code-generating tools (bison/yacc for example), because the code they produce is usually much harder to maintain than the source definitions they take? java bytecode is unmaintainable compared to the source code that produced it, but there are still java VMs instead of interpreters that would run the plain java-language source code.
regular expressions are truly a mess and are not easy to maintain, but their strength is not in writing a 3kb regular expression that you will never be able to change, but in using much shorter regexes (short in metachars, literal matches usually don't degrade readability). for example, if you need to match an identifier, it is usually easiest to write /^[a-z][a-z0-9]*$/i, and this regex is much easier to maintain than the code needed to match without regex. parsing the whole RFC definition of an email address purely in regex is meaningless excercise in cleverness. it is comparable to the IOCCC, not an argument why regular expressions are bad or why the RFC is insane - the complete syntax for email addresses is so hard to parse in regex mainly because it was not designed to be parsed in regex.
regexes are simply over-used for parses too complex for clean implementation in regex.
Admin
:D ;P
And I have all the coffee I need in my blood
Admin
Email validation Regexes are like arseholes... everybody's got one but nobody wants to see the next man's.
The RFC should come with it's own RegEx.
Admin
Admin
Admin
Admin
I stand corrected, you are right of course for JEE, but I did in fact think of JSE. sorry for being too vague...
Admin
if i did not screw up entirely, the regexp at the end of the article does not cover addresses like this:
[email protected]
a friend of mine has an email-address like this (don't ask me why... i guess "stuff" was already taken).
it is perfectly valid, yet many websites and even ms outlook 2000 (or was it an older version of outlook express?) complained.
Admin
http://imgs.xkcd.com/comics/regular_expressions.png
hehehehehehehehehehehehe woot
Admin
Well, it's actually called an "ampersat" so I don't think that's it :) In my opinion, the WTF is that his 20-some lines of code do pretty much the same thing as:
/^[^@ ,]+@[^ ,]+.[^ ,]*$/
Which is a bit shorter... an would allow "?@?." as a valid e-mail address :)
Admin
I've seen bug reports requesting MTAs and MUAs fix themselves to transmit mail to/from these addresses.
Admin
Yes, this is true. The reverse is also true.
However, regular expressions can't express everything that iterative functions can produce. For instance, it's not hard to produce an iterative function that will figure out if the parens in an expression are balanced, but it's provably impossible to write a regular expression to do so. (Unless your regular expression engine acceps things that aren't really regular expressions.)
Admin
Email addresses are not all defined by RFCs. There's a world beyond RFC 2822 you know, and if you have to deal with European government legacies, then you might well encounter it.
This isn't to say that such an address is reachable from teh intawebs, and almost certainly not reachable in that format. They still exist though, and they may still be a personal identifier.
Admin
Uh, eliminate white space and shorten variables to fit into 32K? Please (oh please), you must be talking about interpreted languages like BASIC.
10 PRINT What's your name? 20 INPUT A$ 30 PRINT Hello A$ 40 PRINT Would you like to play a game? 50 INPUT B$ 60 PRINT Yes, we will play thermonuclear warfare.
Ah, the memories.
CAPTCHA: pirates (Arrrrggh)
Admin
Thank you. That is possibly the funniest geeks-only reference I have every read.
Cheers
Admin
I dont work for them but I do think their software is amazing. For all of you giving out about the use of Regular Expressions and the fact that they are impossible to figure out, take a look at RegExBuddy (http://www.regexbuddy.com/) and you will never look back. It helps you to create Regular Expressions for all of us who think it is a black art. I used to but no longer.
Admin
You say ampersat, I say amphora.
captcha: wigwam (wtf?)
Admin
I wish I could mod your comment up.
Admin
The plus-sign? One-letter names? Please. I'd be happy if web sites would just accept the damn .name TLD. It's been in use for three or four years now. Dammit!
Admin
Sure they are, at least in the context of the internet. If you want to hook up your legacy network to the internet, it's your problem translating the address, not the internet's.
Admin
Toby: Retreat from the Dark Side.
Just because you no longer think that it is a black art any more, that doesn't make the statement any less true.
Any damn moron can write a regex. Any damn moron armed with RegexBuddy can write one that works, as of yesterday. (What with RegexBuddy being fed yesterday's information.) Today, maybe. Tomorrow, the World! Except not.
Regexes, whilst they have their place, are inherently fragile. They depend upon using strings as your basic data structure, which unless you're a VB programmer, a Java programmer, or an utter moron, is unlikely to be your ideal choice of representation. Other than an ancient IBM mini from somewhere back in the '60s, the name of which I forget, computers do not think in terms of strings.
So, by all means, use regexes for simple tasks. Validating an email address is not a simple task. Nor is it, in fact, very useful. May I quote a wise man from further up this thread:
And then again:
Then again, there are idiots like "Bat." Y'know, batso, it is in fact possible to understand regular expressions -- even to use them now and again, as appropriate -- without being ignorant. Or, indeed, using anything as an "excuse." I believe the concept is called "design choice." Obviously either Darwin or God made a bad mistake in your case.
And, to:
... No, that's just Monty Python. Standard pick-at-yer-acne stuff. If you really want a geeky reference to bridges, see:
http://www.creativyst.com/Doc/Articles/Mgt/AgileBridges/AgileBridges.htm
Admin
No, but there are in the JavaMail API (http://java.sun.com/products/javamail/).
Of course, you could have used Google...
Admin
As the author of the regexp on http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html, a few comments:
I did not write it by hand. Further, it does not appear in the code in that form. I wrote it by translating the components of the syntax in RFC822 into regexp components, producing around 25 lines of code that map very directly to the RFC822 EBNF spec. The full regexp is compiled (through the magic of string interpolation) when the module is run.
The form that it appears in the code (go read it, it's not that bad) is perfectly maintainable by anyone with reasonable knowledge of regexps and a copy of the RFC822 EBNF to hand (and if you don't have the latter, you shouldn't be writing email address validators). None of the component regular expression assignments are longer than an 80 char line, and that's including descriptive variable names drawn from the original grammar.
I wrote it because (a) I can't stand incorrect validation - either get it right or don't do it at all and (b) I found that using regular expressions actually works better in Perl than doing it "properly" using Parse::RecDescent.
The only reason I include it on the web page in its full 4k horror is to make people understand that any significantly shorter regexp is unlikely to be complete.
In response to an earlier comment: The reason that it can't cope with comments is because RFC822 allows comments to be arbitrarily nested and there's simply no way to cope with that in a regexp. The Perl module recursively applies a regular expression in order to strip out comments before validating the remainder.
If you're interested in proving the shortcomings of some of the shorter regexps, the test script in that module contains a decent set of wierd addresses, and could easily be pointed at a different regexp (credit to the author of the RecDescent validator for most of it).
Paul
Admin
And the biggest WTF of all the comments is that they've completely forgotten that this was supposed to be a JavaScript solution...
Admin
I think .museum is a valid TLD.
Admin
I just use 'something @ something dot something' :
/.@..*/
...or something to that effect.
captcha: 'bathe' - spookily, I just have
Admin
Thank god my computational linguistics teacher didn't make us convert that to a NFA.
shiver
Captha: gotcha (it rhymes)
Admin
Have i missed something or would this javascript mess up on [email protected] thinking that it was a valid address.
Admin
I don't suppose you're reading the comments because otherwise you would have fixed your regexp by now -- but I find it truly pathetic how you derive so much amusements out of making fun of less informed people while messing it up royally yourself.
It was already mentioned that the local part of an e-mail address can validly end in a ".". I would like to add that it is also perfectly valid to have consecutive dashes in the domain name (read Internationalized domain name on Wikipedia).
By posting your broken regexp you are perpetuating the same annoyance that you are ridiculing.
Admin
It was the end of page two of the comments before I found why anyone would WANT to validate an email address beyond two simple requirements:
What are you all up to that it's so important anyway? I saw this one in a forum & think I might keep it myself, and it validates fine.
[email protected]
Otherwise <alphanumericgibberish>@mytrashmail.com validates too. AND allows a validation email, unless the server is particularly snotty.
Admin
Not thinking of language mechanisms, there are two kinds of exceptions: the alternative flow in a usecase, i.e. well documented and testable, something you aware of while coding. And the second category are plain programming errors, so situations that leave your program in an mangled state, something you never anticipated. In the second case, the only reasonable thing to do is to abort the program (or restart it after some error mesage), continueing a program with unkown state is never a good idea.
So the question when to use exception in a language depends on the support for it. In C++ it's impossible to write exception save code for two reasons:
So in C++ exception can only be used as an abort trap, (the programming error case, giving you a chance to log the error before calling abort).
However in a managed language such as C# or Python, exception are an excellent flow control mechanism for handling the alternative flow of a usecase. They often really simplify code and the performance hit is a non-issue, because they usually require user input to resolve the problem.
Admin
Admin
I took a stab at this once. Only good for .NET.
http://www.twilightsoul.com/Domains/Voyager/Patterns/EmailAddresses/tabid/134/Default.aspx
With a pretty full explanation of what I did and did not include from the RFC 2822.
Admin
I see a problem in these comments (and on the Internets as well), in that people are using the word "valid" so casually that it becomes void of meaning. Before you test an email address for validity, you need to carefully define "validity". The reason why some people in these comments think "me@se" is valid and others do not could only be that the word "valid" means different things to them.
If you're going to use the string entered as "EMAIL" as the recipient in an outgoing email, then the definition of "valid" should surely be "usable as recipient", should it not?
And here's the important part: notice how "usable as recipient" is only LOOSELY related to "strictly follows RFCs 2/822". Before you attempt to validate the email field, ask yourself: are you absolutely 100% certain about what is "usable as recipient" when you're sending out mail? No? Then why pretend you can "validate"?
If you're going to check for "RFC 2822 conformity", write "RFC 2822 COMPLIANT STRING" that in the on-line help; don't write "EMAIL". And then make pretty damn sure your validator is RFC 2822 compliant.
For comparison: What about the "NAME" field - would you "validate" that according to some "must contain at least two parts, be capitalized" scheme (maybe there's an RFC, even?), or just allow anything that's "usable as name"?
Or the "PHONE NUMBER"? When Geörge Lucäs enters 1-900-STARWARS, wouldn't it be fun it that validated, since your dial-up marketers actually can use that when placing a callback to see if the knitted mittens were to the customer's liking?
Admin
Here is a much easier to read RFC822 compliant validating Perl 5.9.5/PCRE 7 regex courtesy of Abigail:
Admin
Strictly speaking no, since regular expressions as defined in mathematics/computer science cannot match a recursive pattern.
However, the most commonly used "regex" engines are not actually true "regular expression" engines, and therefore /CAN/ match recursive patterns. PCRE has had this for some time, and Perl5 has had it for even longer using "dynamic patterns" and in Perl 5.9.5 you also have "recursive patterns" as well.
A good rule of thumb is if an engine is documented to do "leftmost-longest" matching then it isnt a true regular expression engine, and therefore hypothetically /can/ match a recursive pattern. Whereas if it is documented as using a DFA or NFA simulating DFA or documented to provide longest-token matching semantics then it will NOT be able to match recursive patterns.
True regular expressions make doing things like backreferences, capturing, lookaround, etc much more difficult (or perhaps impossible) than doing so with the backtracking engines commonly found in programming languages, although true regular expressions have /much/ better worst case performance than the kind you will find in Perl, Python, Java, PCRE, etc. (OTOH Perl and friends probably have better best cases.) All of these engines use backtracking-nfa's as compared to true dfa or dfa simulation. This is for a good reason, in a programming language you can typically avoid the worst case by careful pattern construction, whereas the utility of true regular expression engines is far reduced from that which a backtracking implementation can provide.
TCL has a hybrid engine, and other projects are also doing work in implementing hybrid schemes so as to avoid the worst case performance when possible.
Admin
I fully agree and I wrote more or less the same rant (without code) in French :
http://www.bortzmeyer.org/arreter-d-interdire-des-adresses-legales.html
Admin
Admin
Admin
This solution does not work for addresses like- [email protected]
where there is a "." just before @ sign.
Has anyone found a turn around for this?
Thanks, Arunraj
Admin
This solution does not work for addresses like- [email protected]
where there is a "." just before @ sign.
Has anyone found a turn around for this?
Thanks, Arunraj
Admin
This solution does not work for addresses like- [email protected]
where there is a "." just before @ sign.
Has anyone found a turn around for this?
Thanks, Arunraj