• Vault_Dweller (unregistered)

    At least they used clear variable names.

  • Alexander Malfait (unregistered)

    (Please, don't use a regex for tasks like this)

    Using a regex for tasks like this will almost always be clearer and more robust than doing multiple splits, counting array lengths, rejoining with correct syntax.

    Unless your developers never bothered to learn basic regex.

  • Bart (unregistered) in reply to Alexander Malfait

    Well, humour me an give us the correct RegEx for this task. (I bothered to learn basic regex, but I suck at it.)

  • John Melville (unregistered) in reply to Alexander Malfait

    Cannot agree more. It has been nearly 15 years for me since I a book basically told me to buck up an learn regular expressions, because that is how pros manage string data. I accepted a "friendly challenge" to use nothing but Regex's for string manipulation over the course of 30 days. I have never looked back. Regexes are only confusing if you do not understand them. Yes, regexes are code, and they need good unit tests like any code to make sure they are correct. But a single regex can expand into tens or hunndreds of lines of splits and loops. They are now my go-to tool for string manipulation and searching.

  • Flipull (unregistered)

    I love that external parameter called "howmanydots" - Wondering why the programmer lacked to implement that as a variable inside the function.

  • (nodebb) in reply to Alexander Malfait

    When I have interviewed candidates for Java programmer roles, competence with regex was always a requirement. I recommend https://regex101.com/quiz to Remy. :-)

  • dpm (unregistered)

    I always use try/catch blocks when performing incredibly complex tasks like [checks notes] extracting a sub-string.

  • guest (unregistered)

    Let's see any of the proposed solution find the correct domain for news.bbc.co.uk. Good luck.

  • peanutlord (unregistered)

    Wouldn’t a substring check also not solve the thing but more clear than split or regexp?

  • Charles (unregistered)

    While regexes should be used where appropriate, in this case they may not, unless you are using an enhanced kind. There are two operations required - to split the address into its local and domain parts, and to split the domain part into its labels. The first task is merely finding the last '@' character, which is perfectly suited to a regex, though there may be a simpler alternative (e.g. python has rfind and rindex). The second needs to return a variable number of items, which many regex implementations cannot do, so it may be better to use a function that splits a string at a delimiter, if your language has such a thing.

    And don't forget that you shouldn't generally impement this sort of thing - use a library function. Otherwise you'll probably get it wrong. In the case of email addresses, many people would look for the first '@' which is wrong as that is a valid character in the local part of an email address. Test your implementation on this:

    "!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~"@example.

    Yes, that has all possible characters in the local part, and the domain ends in a dot - both perfectly valid.

  • Richard Brantley (unregistered) in reply to Alexander Malfait

    “ Using a regex for tasks like this will almost always be clearer….” Honestly, I disagree. Regular Expressions was not designed for legibility and expressiveness.

  • Tinkle (unregistered) in reply to Rick

    Email addresses are not basic regex.

    Regex tends to be write-only unless you use them regularly (pun not intended,) so I try to avoid them unless it is a simple task.

    Here is my (badly broken) attempt at getting the domain from an email address.

    .@(..)?(?<domain>.+..+)((.*)$)?

    1. It will fail if there is more than one @
    2. It will only get what is right of the last .
    3. It fails to trim off a comment from the end of the email address.
  • (author)

    It seems I should expand: regexes are complicated and perform like hot garbage. When your task is manipulating a string based on a set of single character markers, a regex is like using a semi truck to haul a lemon to your door. It certainly works, and if you already own the semi truck, then it doesn't seem that much of a problem, but it doesn't exactly feel like the right tool for the job.

    I'm not anti-regex (and I'm quite the expert at using regexes in the find/replace box of my editor to fix my terrible code), but regexes should always be the last tool you reach for in code.

    (I just wrote some tokenizer code using regexes last night because it was late and I was lazy, but regexes were absolutely the wrong tool)

  • grog_brain_dev (unregistered) in reply to Alexander Malfait

    me no use regex but then me no big brain dev. me grog-brain dev. still me think this better than any regex

    #!/bin/env ruby domain = "[email protected]".split(".")[-2..-1]

  • Fred (unregistered)

    The real WTF is the misunderstanding of DNS, in particular the fact that effective TLD are not necessarily the very last label of the domain name.

    Is it really the intention of the code that microsoft.co.uk and apple.co.uk should be the same ".co.uk" domain?

  • Tinkle (unregistered)

    Call me slow today, but isn't the real WTF wanting foo.bar.com and bar.com to be recognised as the same domain? They are different and can have almost no relation to each other.

  • WellActually (unregistered)

    Effective TLDs that contain a dot, like .co.uk, mean that the problem is not really solvable by simple text manipulation only. Luckily adding in some check against public suffix list shouldn't be too hard.

  • Pag (unregistered) in reply to Fred

    No, look again. This is what howmanydots is for.

  • (nodebb) in reply to Fred

    They didn't know about the Public Suffix List.

  • Fire Mountain (unregistered) in reply to Rick

    That quiz site actually says exactly what CAN be wrong about regex. I use regex all the time in linux with sed/awk. And that site tells me I'm wrong on my regex submissions to the quiz because I don't use EXACTLY the syntax or EXACTLY the solution they want you to use. Syntax being a large part of it.

    That's why regex can be hard. There are about a dozen or more implementations to deal with in the world. What needs to be escaped? What doesn't? Is the replacement grouping using \1 or $1? etc...

    There is no ONE TRUE REGEX(tm)...

  • (author) in reply to Fire Mountain

    Oh man, the \1 vs $1 drives me up the wall. So many times I hit enter and change my code to loads of \1 strings everywhere.

  • (nodebb)

    IndexOf vs. Regex seems like a red herring. Isn't there a library to do this?

  • COB666 (unregistered)

    This logic kind of falls apart when you have domains like company.co.uk

  • Gearhead (unregistered)

    I love regex's, but the howManyDots requirement significantly complicates the regex. You'd end up with something like

    $emailAddress =~ m/@(([^\.]+\.){0,$howManyDots}$[^\.]+)/;

    And that doesn't even work for howManyDots = 0.

  • Gearhead (unregistered)

    Ack. Messed up the dollar sign placement.

    $emailAddress =~ m/@(([^\.]+\.){0,$howManyDots}[^\.]+)$/;

  • Nido (unregistered)

    Nobody notied how the exame abc.def.gih.com needs to be dissected two different ways according to the comment?

  • Nido (unregistered)

    Nobody noticed how the example abc.def.gih.com needs to be dissected two different ways according to the comment?

  • (nodebb) in reply to Fred

    It's worse than that, actually, because there are a few legacy .uk domains that go directly onto .uk without an intervening "flavour" modifier (co, edu, gov, etc.). The primary examples are "parliament.uk" and "nhs.uk".

    And France has a similar issue, where the "effective TLD" also isn't a fixed number of words. Companies go directly on .fr (leroymerlin.fr), while government departments go on .gouv.fr (impots.gouv.fr)

  • (nodebb) in reply to Fred

    It's worse than that, actually, because there are a few legacy .uk domains that go directly onto .uk without an intervening "flavour" modifier (co, edu, gov, etc.). The primary examples are "parliament.uk" and "nhs.uk".

    And France has a similar issue, where the "effective TLD" also isn't a fixed number of words. Companies go directly on .fr (leroymerlin.fr), while government departments go on .gouv.fr (impots.gouv.fr)

    Addendum 2023-06-12 13:44: bah bah,, double double posted posted..

  • (nodebb)

    This looks like C# to me. Here's a one-liner:

    return string.Join(".", new Regex("^.@(?<Part>[^.]+)(.(?<Part>[^.]+))$").Match(email).Groups["Part"].Captures.Cast<Capture>().Reverse().Take(howManyDots + 1).Reverse().Select(c=>c.Value));

    This is not tested, but should be correct or very close to being correct.

  • (nodebb)

    Or a little better:

    static readonly Regex _rgxEmailDomainParts = new Regex("^.@(?<Part>[^.]+)(.(?<Part>[^.]+))$");

    var group = _rgxEmailDomainParts.Match(email).Groups["Part"];

    var list = new List<string>();

    for(int i = 0; i < howManyDots; ++i) if(i < group.Captures.Count) list.Add(group.Captures[group.Captures.Count - howManyDots + i].Value); // I think I got the addition/subtraction right? don't quote me on this :)

    return string.Join(".", list);

  • (nodebb) in reply to Charles

    This may be a perfectly legal email address, but come on, man.

  • (nodebb) in reply to Steve_The_Cynic

    And now here in Australia, we have opened up '.au', so you can have 'company.au' or 'company.com.au'.

  • (nodebb)

    Most efficient version, without regex:

    int lastIndex = email.Length;
    for(int i = 0; i < howManyDots; ++i)
    {
      int index = email.LastIndexOf(".", lastIndex);
      if(index == -1)
      {
        lastIndex = email.LastIndexOf("@") - 1;
        break;
      }
      lastIndex = index - 1;
    }
    return email.Substring(lastIndex + 2);
    
  • DigitalBits (unregistered)

    Regexes (and for that matter, string splitting) won't work. As others have mentioned, some top level domains have subdomains that need to be taken into account.

    For example: [email protected] and [email protected] are both valid, but you certainly don't want to confuse bobindustries.govt.nz with aliceindustries.govt.nz.

    The only way to solve this is to lookup a list of valid public top-level domains. Any other solution is going to be wrong for a relatively large number of users.

  • FTB (unregistered)

    Just reverse the string and compare the first (n) chars. No splits or regexes needed.

    PS: Regex? Barf!

  • Jaloopa (unregistered)

    How to rile up a group of nerds: say regex is inappropriate for some string handling task

  • (nodebb) in reply to ray73864

    oof...

    For a real nightmare, though, we should pay attention to the .us ccTLD, as described in https://en.wikipedia.org/wiki/.us because it has suffixes of two or three words, except where it doesn't. Like when the suffix order is reversed: stuff.state.XX.us where XX is the standard USPS abbreviation of the state's name, and in some cases, you get things like stuff.gov.state.XX.us for a four-word suffix.

  • Ruud (unregistered)

    .NET check .COM check .DE check .CO.UK ..... Oh S*t

  • A Nonny Moose (unregistered) in reply to Steve_The_Cynic

    And now anyone can register a domain directly under .uk too.

    As described in the article, it's a badly thought out requirement unless it's only being used for a limited set of domains - and in that case, it would be better to use a config file to define what domains are being treated as "the same".

  • Gearhead (unregistered) in reply to Mr. TA

    Agree with the approach, but your code misses a requirement. how about this?

    int lastIndex = email.Length;
    if (howManyDots > 3) {
      return emailAddress.Substring(emailAddress.IndexOf('@') + 1);
    }
    for(int i = 0; i < howManyDots; ++i)
    {
      int index = email.LastIndexOf(".", lastIndex);
      if(index == -1)
      {
        lastIndex = email.LastIndexOf("@") - 1;
        break;
      }
      lastIndex = index - 1;
    }
    return email.Substring(lastIndex + 2);
    
  • Gearhead (unregistered) in reply to Nido

    Nobody notied how the exame abc.def.gih.com needs to be dissected two different ways according to the comment?

    Not sure I follow. The comment says

    ` //abc.def.ghi.com. should return abc.def.ghi.com

    Which is consistent with howManyDots = 3 returning a string with three dots.

  • (nodebb) in reply to Tinkle

    Fred, Tinkle et al have got it right. Ignoring the problem domain and treating this only as a string-handling problem means you're only dealing with one half of the WTF (a significant half, I'll admit).

    Are transport.nsw.gov.au and education.nsw.gov.au the same domain or different? (Correct answer: different)

    Looking at these four University domains, are they the same or different? sydney.edu.au unsw.edu.au usyd.edu.au wsu.edu.au (Correct answer: two are the same, the other two are different.)

    If the overall goal is to accurately determine whether two things are part of the same domain or not, then the correct solution is probably a single line of string handling, followed by a metric WTFload of domain record handling.

  • (nodebb) in reply to Paddles

    Looking at these four University domains, are they the same or different? sydney.edu.au unsw.edu.au usyd.edu.au wsu.edu.au (Correct answer: two are the same, the other two are different.)

    That looks more like a case of them being four different domains belonging to three different human organisations. If the problem is to detect different domains, they are different. If the problem is to detect them being different human organisations, you're basically up to your neck in slime and lasers(1), and you don't have a Travis available to pull you out.

    (1) https://www.imdb.com/title/tt0500274/quotes/?ref_=tt_trv_qu in the conversation between Major Thania and Trooper Par.

  • Charles (unregistered) in reply to konnichimade

    Some programmers obsessively implement exactly what a standard says: others implement what the think it says or what they want it to say. Products from the first kind of programmer interwork well: those from the second group suffer lots of corner-case bugs.

    And in the case of email, we have a very good example of failures due to the second kind of programmer that many people may have seem or personally experienced - when Google implemented the ability to use multiple addresses in Gmail using '+', some attempted uses wouldn't work because some bad programmer had attempted to validate an email address but disallowed '+' in it.

    And to look at the big picture, I cannot see any legitimate reason to attempt to do what this code seems to be doing. It's a bad idea and impossible anyway. Not only are there cases like .co.uk and .ac.uk which are TLDs, but you can buy .uk domains (not all, obviously!), so for example, mail.uk and mail.co.uk seem to be some kind of mailbox hosting service. It's almost certainly one of those XY situations where you need to go back to whoever asked for this and find out what they really want. And possibly just say no to them.

  • Carsten (unregistered) in reply to Charles

    It is a shame that even on the tdwtf so many hopped on to improve a algorithmen that failed on the problem, you shouldn't solve yourself. Splitting user and host part: Create a email object and retrieve host part. Emails allow for many escape and quote sequences. If you are not sure that the address is valid itself, the last @ might not be right split. And as many pointed out finding a common second level domain behind a TLD is even more tricky. Pretty close to date handling ;-)

  • Just another coder (unregistered)

    Just forget all about the second parameter. I guess it's Java but could also compile in C# if I'm not mistaken.

    //using System.Linq;

    string GetDomain(string emailAddress, int _) { string fullDomain = emailAddress.Substring(emailAddress.IndexOf('@') + 1); var parts = fullDomain.Split('.'); if (parts[parts.Length - 2] == "co") return string.Join(".", parts.Skip(parts.Length - 3)); return string.Join(".", parts.Skip(parts.Length - 2)); }

  • Nido (unregistered) in reply to Gearhead

    Abc.def.ghi.com is consistent with the comment under howmanydots==3, but it fails to be consistent with the comment under howmanydots==2, which says it should return def.ghi.com for the same input string

  • Nido (unregistered) in reply to Gearhead

    Abc.def.ghi.com is consistent with the comment under howmanydots==3, but it fails to be consistent with the comment under howmanydots==2, which says it should return def.ghi.com for the same input string

  • (nodebb)

    Just do this: throw new StupidRequirementException("WTF are you on");

    There is NOTHING safely useful that this function can achieve, not matter how it's written. The RWTF is the idea that it could be useful.

    Take three email addresses: [email protected] [email protected] [email protected] Same company? You don't know. Same mailbox? You don't know. Same mailserver? You don't know.

    If you assume you know anything about what happens to an email after it leaves your site you are wrong.

Leave a comment on “Split the Domain”

Log In or post as a guest

Replying to comment #:

« Return to Article