- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
At least they used clear variable names.
Admin
Using a regex for tasks like this will almost always be clearer and more robust than doing multiple splits, counting array lengths, rejoining with correct syntax.
Unless your developers never bothered to learn basic regex.
Admin
Well, humour me an give us the correct RegEx for this task. (I bothered to learn basic regex, but I suck at it.)
Admin
Cannot agree more. It has been nearly 15 years for me since I a book basically told me to buck up an learn regular expressions, because that is how pros manage string data. I accepted a "friendly challenge" to use nothing but Regex's for string manipulation over the course of 30 days. I have never looked back. Regexes are only confusing if you do not understand them. Yes, regexes are code, and they need good unit tests like any code to make sure they are correct. But a single regex can expand into tens or hunndreds of lines of splits and loops. They are now my go-to tool for string manipulation and searching.
Admin
I love that external parameter called "howmanydots" - Wondering why the programmer lacked to implement that as a variable inside the function.
Admin
When I have interviewed candidates for Java programmer roles, competence with regex was always a requirement. I recommend https://regex101.com/quiz to Remy. :-)
Admin
I always use try/catch blocks when performing incredibly complex tasks like [checks notes] extracting a sub-string.
Admin
Let's see any of the proposed solution find the correct domain for news.bbc.co.uk. Good luck.
Admin
Wouldn’t a substring check also not solve the thing but more clear than split or regexp?
Admin
While regexes should be used where appropriate, in this case they may not, unless you are using an enhanced kind. There are two operations required - to split the address into its local and domain parts, and to split the domain part into its labels. The first task is merely finding the last '@' character, which is perfectly suited to a regex, though there may be a simpler alternative (e.g. python has rfind and rindex). The second needs to return a variable number of items, which many regex implementations cannot do, so it may be better to use a function that splits a string at a delimiter, if your language has such a thing.
And don't forget that you shouldn't generally impement this sort of thing - use a library function. Otherwise you'll probably get it wrong. In the case of email addresses, many people would look for the first '@' which is wrong as that is a valid character in the local part of an email address. Test your implementation on this:
"!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~"@example.
Yes, that has all possible characters in the local part, and the domain ends in a dot - both perfectly valid.
Admin
“ Using a regex for tasks like this will almost always be clearer….” Honestly, I disagree. Regular Expressions was not designed for legibility and expressiveness.
Admin
Email addresses are not basic regex.
Regex tends to be write-only unless you use them regularly (pun not intended,) so I try to avoid them unless it is a simple task.
Here is my (badly broken) attempt at getting the domain from an email address.
.@(..)?(?<domain>.+..+)((.*)$)?
Admin
It seems I should expand: regexes are complicated and perform like hot garbage. When your task is manipulating a string based on a set of single character markers, a regex is like using a semi truck to haul a lemon to your door. It certainly works, and if you already own the semi truck, then it doesn't seem that much of a problem, but it doesn't exactly feel like the right tool for the job.
I'm not anti-regex (and I'm quite the expert at using regexes in the find/replace box of my editor to fix my terrible code), but regexes should always be the last tool you reach for in code.
(I just wrote some tokenizer code using regexes last night because it was late and I was lazy, but regexes were absolutely the wrong tool)
Admin
me no use regex but then me no big brain dev. me grog-brain dev. still me think this better than any regex
#!/bin/env ruby domain = "[email protected]".split(".")[-2..-1]
Admin
The real WTF is the misunderstanding of DNS, in particular the fact that effective TLD are not necessarily the very last label of the domain name.
Is it really the intention of the code that microsoft.co.uk and apple.co.uk should be the same ".co.uk" domain?
Admin
Call me slow today, but isn't the real WTF wanting foo.bar.com and bar.com to be recognised as the same domain? They are different and can have almost no relation to each other.
Admin
Effective TLDs that contain a dot, like .co.uk, mean that the problem is not really solvable by simple text manipulation only. Luckily adding in some check against public suffix list shouldn't be too hard.
Admin
No, look again. This is what howmanydots is for.
Admin
They didn't know about the Public Suffix List.
Admin
That quiz site actually says exactly what CAN be wrong about regex. I use regex all the time in linux with sed/awk. And that site tells me I'm wrong on my regex submissions to the quiz because I don't use EXACTLY the syntax or EXACTLY the solution they want you to use. Syntax being a large part of it.
That's why regex can be hard. There are about a dozen or more implementations to deal with in the world. What needs to be escaped? What doesn't? Is the replacement grouping using \1 or $1? etc...
There is no ONE TRUE REGEX(tm)...
Admin
Oh man, the
\1
vs$1
drives me up the wall. So many times I hit enter and change my code to loads of\1
strings everywhere.Admin
IndexOf vs. Regex seems like a red herring. Isn't there a library to do this?
Admin
This logic kind of falls apart when you have domains like company.co.uk
Admin
I love regex's, but the howManyDots requirement significantly complicates the regex. You'd end up with something like
$emailAddress =~ m/@(([^\.]+\.){0,$howManyDots}$[^\.]+)/;
And that doesn't even work for howManyDots = 0.
Admin
Ack. Messed up the dollar sign placement.
$emailAddress =~ m/@(([^\.]+\.){0,$howManyDots}[^\.]+)$/;
Admin
Nobody notied how the exame abc.def.gih.com needs to be dissected two different ways according to the comment?
Admin
Nobody noticed how the example abc.def.gih.com needs to be dissected two different ways according to the comment?
Admin
It's worse than that, actually, because there are a few legacy .uk domains that go directly onto .uk without an intervening "flavour" modifier (co, edu, gov, etc.). The primary examples are "parliament.uk" and "nhs.uk".
And France has a similar issue, where the "effective TLD" also isn't a fixed number of words. Companies go directly on .fr (leroymerlin.fr), while government departments go on .gouv.fr (impots.gouv.fr)
Admin
It's worse than that, actually, because there are a few legacy .uk domains that go directly onto .uk without an intervening "flavour" modifier (co, edu, gov, etc.). The primary examples are "parliament.uk" and "nhs.uk".
And France has a similar issue, where the "effective TLD" also isn't a fixed number of words. Companies go directly on .fr (leroymerlin.fr), while government departments go on .gouv.fr (impots.gouv.fr)
Addendum 2023-06-12 13:44: bah bah,, double double posted posted..
Admin
This looks like C# to me. Here's a one-liner:
return string.Join(".", new Regex("^.@(?<Part>[^.]+)(.(?<Part>[^.]+))$").Match(email).Groups["Part"].Captures.Cast<Capture>().Reverse().Take(howManyDots + 1).Reverse().Select(c=>c.Value));
This is not tested, but should be correct or very close to being correct.
Admin
Or a little better:
static readonly Regex _rgxEmailDomainParts = new Regex("^.@(?<Part>[^.]+)(.(?<Part>[^.]+))$");
var group = _rgxEmailDomainParts.Match(email).Groups["Part"];
var list = new List<string>();
for(int i = 0; i < howManyDots; ++i) if(i < group.Captures.Count) list.Add(group.Captures[group.Captures.Count - howManyDots + i].Value); // I think I got the addition/subtraction right? don't quote me on this :)
return string.Join(".", list);
Admin
This may be a perfectly legal email address, but come on, man.
Admin
And now here in Australia, we have opened up '.au', so you can have 'company.au' or 'company.com.au'.
Admin
Most efficient version, without regex:
Admin
Regexes (and for that matter, string splitting) won't work. As others have mentioned, some top level domains have subdomains that need to be taken into account.
For example: [email protected] and [email protected] are both valid, but you certainly don't want to confuse bobindustries.govt.nz with aliceindustries.govt.nz.
The only way to solve this is to lookup a list of valid public top-level domains. Any other solution is going to be wrong for a relatively large number of users.
Admin
Just reverse the string and compare the first (n) chars. No splits or regexes needed.
PS: Regex? Barf!
Admin
How to rile up a group of nerds: say regex is inappropriate for some string handling task
Admin
oof...
For a real nightmare, though, we should pay attention to the .us ccTLD, as described in https://en.wikipedia.org/wiki/.us because it has suffixes of two or three words, except where it doesn't. Like when the suffix order is reversed:
stuff.state.XX.us
whereXX
is the standard USPS abbreviation of the state's name, and in some cases, you get things likestuff.gov.state.XX.us
for a four-word suffix.Admin
.NET check .COM check .DE check .CO.UK ..... Oh S*t
Admin
And now anyone can register a domain directly under .uk too.
As described in the article, it's a badly thought out requirement unless it's only being used for a limited set of domains - and in that case, it would be better to use a config file to define what domains are being treated as "the same".
Admin
Agree with the approach, but your code misses a requirement. how about this?
Admin
Not sure I follow. The comment says
` //abc.def.ghi.com. should return abc.def.ghi.com
Which is consistent with howManyDots = 3 returning a string with three dots.
Admin
Fred, Tinkle et al have got it right. Ignoring the problem domain and treating this only as a string-handling problem means you're only dealing with one half of the WTF (a significant half, I'll admit).
Are transport.nsw.gov.au and education.nsw.gov.au the same domain or different? (Correct answer: different)
Looking at these four University domains, are they the same or different? sydney.edu.au unsw.edu.au usyd.edu.au wsu.edu.au (Correct answer: two are the same, the other two are different.)
If the overall goal is to accurately determine whether two things are part of the same domain or not, then the correct solution is probably a single line of string handling, followed by a metric WTFload of domain record handling.
Admin
That looks more like a case of them being four different domains belonging to three different human organisations. If the problem is to detect different domains, they are different. If the problem is to detect them being different human organisations, you're basically up to your neck in slime and lasers(1), and you don't have a Travis available to pull you out.
(1) https://www.imdb.com/title/tt0500274/quotes/?ref_=tt_trv_qu in the conversation between Major Thania and Trooper Par.
Admin
Some programmers obsessively implement exactly what a standard says: others implement what the think it says or what they want it to say. Products from the first kind of programmer interwork well: those from the second group suffer lots of corner-case bugs.
And in the case of email, we have a very good example of failures due to the second kind of programmer that many people may have seem or personally experienced - when Google implemented the ability to use multiple addresses in Gmail using '+', some attempted uses wouldn't work because some bad programmer had attempted to validate an email address but disallowed '+' in it.
And to look at the big picture, I cannot see any legitimate reason to attempt to do what this code seems to be doing. It's a bad idea and impossible anyway. Not only are there cases like .co.uk and .ac.uk which are TLDs, but you can buy .uk domains (not all, obviously!), so for example, mail.uk and mail.co.uk seem to be some kind of mailbox hosting service. It's almost certainly one of those XY situations where you need to go back to whoever asked for this and find out what they really want. And possibly just say no to them.
Admin
It is a shame that even on the tdwtf so many hopped on to improve a algorithmen that failed on the problem, you shouldn't solve yourself. Splitting user and host part: Create a email object and retrieve host part. Emails allow for many escape and quote sequences. If you are not sure that the address is valid itself, the last @ might not be right split. And as many pointed out finding a common second level domain behind a TLD is even more tricky. Pretty close to date handling ;-)
Admin
Just forget all about the second parameter. I guess it's Java but could also compile in C# if I'm not mistaken.
//using System.Linq;
string GetDomain(string emailAddress, int _) { string fullDomain = emailAddress.Substring(emailAddress.IndexOf('@') + 1); var parts = fullDomain.Split('.'); if (parts[parts.Length - 2] == "co") return string.Join(".", parts.Skip(parts.Length - 3)); return string.Join(".", parts.Skip(parts.Length - 2)); }
Admin
Abc.def.ghi.com is consistent with the comment under howmanydots==3, but it fails to be consistent with the comment under howmanydots==2, which says it should return def.ghi.com for the same input string
Admin
Abc.def.ghi.com is consistent with the comment under howmanydots==3, but it fails to be consistent with the comment under howmanydots==2, which says it should return def.ghi.com for the same input string
Admin
Just do this: throw new StupidRequirementException("WTF are you on");
There is NOTHING safely useful that this function can achieve, not matter how it's written. The RWTF is the idea that it could be useful.
Take three email addresses: [email protected] [email protected] [email protected] Same company? You don't know. Same mailbox? You don't know. Same mailserver? You don't know.
If you assume you know anything about what happens to an email after it leaves your site you are wrong.