- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Since cleaning data was mentioned... Back in 92-94 I was responsibility for taking the printed (OK, SGML) products of a major tech publisher (one of the big two, and not ZD) and producing an electronic subscription product. The technology was pretty straight forward (though there is a tale or two to be told there), but the enormous aspect was determining wat words meant so see of they were talking about the same thing...
Admin
The code is a WTF, for sure. But the underlying issue is the existence of multiple systems where the data can reside. This can be names or any other object and field. It happens especially often in financial firms. They just love buying software. They buy a trading system, and an accounting system, and who knows what else. These third party packages never do exactly what the client needs, usually not even close; the vendor sells the client N hours of consulting to "configure" the software. Then the fun starts; an army of staff and contractors move and massage data between all the systems. The result is always a mess. The cost is triple; 1) buy software license plus recurring renewal, 2) then customize it to work for them, 3) then continuously ETL data around. Also, eventually people request reports, so a data warehouse is built with more data integration and coding and maintenance.
If you so much as suggest that an in house solution is built, you're laughed out of the room. "Do you really think you develop a ______ system yourself? Hah!"
🤦♂️
Admin
That commented out loop makes me think that the original request/programmers original interpretation of the request was to only get the first name. The code was probably something like:
It explains why the current version is as is: it closely matches the logic of the original code
Admin
Not sure how to add new lines there, but you should be able to make out the gist of it.
Admin
No. No, that is not the same thing. That makes my head hurt reading it.
Admin
Working in machine learning, this hit way too close to home... Everyone gets all excited about the latest transformers derivative with a "clever" name, but the real value will forever be data science. So, so, so many people get into this job thinking they're going to develop clever models, but sorry, your job is cleaning and transforming data, and then MAYBE some model development, IF the data is good enough. (University ML courses really need to make this point a bit clearer.)
This is part of the reason Google et al. dominate the industry. They have the data, and the resources to clean and sort the data. Nobody else can compete.
Admin
"so what's the longest surname you can find that's a substring of a reasonably common first name"
It doesn't have to be even all that complicated. I have a friend who's first and last names are the same. Came about as a result of the parents naming their kid after the mother's maiden name, the parents splitting, the child being raised by the mother. When old enough, the child decided they didn't want their father's last name. So, had it legally changed to their mother's maiden name. The result is first and last name being the same.
I'm sure it's not the only case...
Admin
https://en.wikipedia.org/wiki/Boutros_Boutros-Ghali offers a different but related problem.
Admin
“Three Men in a Boat” was written by…
Jerome K. Jerome.
Admin
Fun related wiki article: https://en.wikipedia.org/wiki/List_of_people_with_reduplicated_names
which includes this completely outrageous name: Leone Sextus Denys Oswolf Fraudatifilius Tollemache-Tollemache de Orellana Plantagenet Tollemache-Tollemache
Good luck parsing that.
Admin
https://en.wikipedia.org/wiki/Lang_Lang
Admin
Funny timing. I just had a go at a national fast food chain whose website wouldn't allow my 2-word last name because it has a space in it (or any punctuation). Not that I was upset about it, but perplexed that it's an issue in 2021 (and not in some mom-and-pop store, with a website built by their nephew who is "good with computers").
We can't assume a last name is only one word. Can we assume a first name is one word (even if hyphenated)? Not counting middle names. I can't think of an example off the top of my head, but I doubt it. Even if you could, you just know someone somewhere has entered a hyphenated or concatenated name as two separate words, even if it shouldn't be.
So how do you parse a string containing the whole name, which has more than 2 words, to determine what the first name is? Obviously not this. I guess in the absence of more information, the best you could do is return the first word.
Admin
If first name is same as surname then the method actually works, the result is still the same (buggy algorithm just happens to create correct answer which is also kinda sad).
Admin
Try living in a country that insists on everybody having two surnames when you only have one.
eg. Spain.
Admin
We had one old system break because it assumed surnames would never be more than 30 characters and one Iberian guy had something like "De Valadares y Seguenda Villaraya Garcia Marquez"
Admin
Nope. There's a guy in my chess club who goes by two first names, unhyphenated. And not just the first one on its own, either.
Admin
Cheating slightly because he is fictional: Catch 22 has a character who was named Major Major Major and who was accidentally promoted to major, so he is Major Major Major Major.
Admin
Certainly some people are named "Mary Ann" as a first name in addition to a middle name.
I think the correct answer is that names cannot be parsed or manipulated for display. Many modern tools ask for multiple versions of a person's name for different purposes. For instance, Slack asks for a "full name" and a "display name" (which they indicate could be a first name, or whatever you wish to be called)
Of course, entity resolution is an entirely different situation which could call for considerable manipulation....but the results probably shouldn't be shown to the user.
Addendum 2021-08-31 10:36: (meant that Mary Ann would also have a separate middle name)
Admin
I brought my wife to the US from Brazil on a K-1 visa. She and our kids are Brazilian citizens (our children are also US citizens by virtue of my citizenship). Brazilians generally receive one surname from each parent (mother followed by father) and this may sometimes include prefixes ("Gomes dos Santos" for example). All Brazilian documentation for my wife and kids have a single given name and two surnames, while US documentation has given names and one surname. It makes international travel fun.
Admin
Let me just say... writing a good name parser is a bitch. A real bitch. Wish google api would come up w/ something to meet the need. Anyone have any suggestions out there? Thanks.
Admin
Be plenty in Scotland and Ireland ... e.g Alastair McAlastair, Connor O'Connor, Donald MacDonald....
Admin
The longest surname I can come up with that is a proper substring of a fairly common given name is "Hall", which can be found in names like "Challis", "Dashall", "Halle", "Khallil" and "Marshall". Can anyone match or beat four characters? Presumably, non-Roman alphabets are fair play.
Admin
I've met a guy who had NO surname. Apparently an effect of being born without a legal father in his culture.
So yes, parsing names is always impossible.
Admin
Slightly off topic but related: Password rules. I recently had trouble because my password didn't fulfill the "no spaces" rule. Only... It was my existing password and the rule was being checked on the login mask, not the register mask.
Admin
somehow this made me think of "Dr. Chandrasekharamphili", the full name of Dr Chandra from "2010". (that's in the novel)