Comment On Unglobalization

Don't ya'll hate when ya get all them funny French-lookin' letters muckin' up the datas on yer Internets? Well now ya'll can thank Jean-Philippe Daigle for inventin' this handy dandy function here. Yeeeeeeeee Haw! [expand full text]
« PrevPage 1Next »

re: Unglobalization

2004-10-15 13:54 • by Sherrif Roscoe
Removing French letters? I think this should be named freedomizeText() instead!

re: Unglobalization

2004-10-15 14:15 • by skicow
WTF?! Why is he passing the StringBuilder byref and then returning it as well?!?

re: Unglobalization

2004-10-19 03:46 • by Tomalak
I can think of at least one good reason to do things like this: Imagine somethng like an intranet phone book in an international firm. Nobody wants to vgrep-cut-and-paste the german 'ö' from some dubious charset software just because the DARN thing is missing on the swedish-layout keybord and there is no other way to get the desired phone numer.

That does not mean that the 'ö' is not going to display in the intranet page, but the application is much more usable when it displays some "could be" matches.

Despite the lengthy and maybe naive implemetation I can't see a WTF here.

re: Unglobalization

2004-10-15 14:17 • by gunnar
He forgot 'Ø' and 'ø'.

re: Unglobalization

2004-10-15 14:31 • by Jacques Troux
He also forgot the French ligatures "Œ" and "œ".
Icelanders will appreciate their "ð" being changed to "o", which is a completely unrelated letter.

re: Unglobalization

2004-10-15 14:32 • by Eric
I think he also forgot to replace £,€,¥, etc. with $.

re: Unglobalization

2004-10-15 14:36 • by Mike R.
Suprised they didn't replace þ with anything..

re: Unglobalization

2004-10-15 14:36 • by Jean-Philippe Daigle
We'd see this everywhere if programmers were paid per line of code. ;)

re: Unglobalization

2004-10-15 15:05 • by DCD
Stringbuilder is an object so passing byref is unnecessary unless he wants to change the ref.

re: Unglobalization

2004-10-15 15:14 • by Phil Scott
I'm willing to bet this was VB6 code "upgraded" into a .NET project. Since ByRef was the default in VB6, the upgrade wizards probably threw it in there.

And the developer probably at one time heard that strings were slow, so they decided to replace the String being passed in with a stringBuilder.

re: Unglobalization

2004-10-15 15:16 • by Mike R.
Incidently, þ would be converted as "th". They'd probably replace it with something silly like "p" ... which would change it's name, þorn to something... err, less... wholesome ;)

re: Unglobalization

2004-10-15 15:16 • by Adam W.
Pedantic correction: it's spelled "y'all", seeing as it's a contraction of "you all".

re: Unglobalization

2004-10-15 15:18 • by WanFactory
Unless my understanding of VB(.NET?) is severely flawed, I think both the byref and the return are completely unnecessary.

If someone knows differently please speak up, I'm just a Java guy making possibly unwarranted assumptions.

re: Unglobalization

2004-10-15 15:24 • by Phil Scott
Totally of subject, but HOLY CRAP look what's on the consulting page from the oracle dude from yesterday: http://www.dba-oracle.com/redneck.htm

re: Unglobalization

2004-10-15 15:25 • by Phil Scott
Errr, "totally off subject"...

re: Unglobalization

2004-10-15 15:38 • by Jim Bolla
While this is certainly amusing, and its implementation may be flawed, the need for it could actually be justified if for instance one were building an app that had to talk to a legacy system that would explode on all those crazy characters.

re: Unglobalization

2004-10-15 15:39 • by josh
I look forward to seeing the function that cleans up Chinese. :)

re: Unglobalization

2004-10-15 15:42 • by AvonWyss
It's certainly not a very nice piece of code, but on the other hand, I have wished to find some function in the framework to convert those special chars to ASCII chars. But my search has not been successful. They didn't even bother to do some meaningful conversion when you make an System.Text.Encoding.ASCII.GetBytes("öäü"), you only get the bytes for '?' in the resulting byte array. Now that's not useful and I believe that this kind of problems leads to such inventions.

re: Unglobalization

2004-10-15 16:02 • by mike roome
"Icelanders will appreciate their 'ð' being changed to 'o', which is a completely unrelated letter. "

Unless it's uppercase, of course, in which case it gets changed to D...

----

"on the other hand, I have wished to find some function in the framework to convert those special chars to ASCII chars."

That's impossible to do in a general way, though. You can't just replace a letter with a diacritic with the corresponding letter without the diacritic, since that can change the meaning of words entirely, and any translation of non-ascii characters to sequences of ascii characters (for instance, german ö can be replaced with oe) is highly culture dependent (and that's without even contemplating non-roman scripts, where the translation of individual codepoints is not only culture dependent, but also dependent on what romanisation method you use).

If you need to store things in a legacy system that only stores 8-bit encodings, the correct way to handle it is to use utf-8. If you need the text to be readable ascii, then you need to constrain the input to reject any non-ascii characters, which will allow people to choose what ascii representation they want, rather than automatically producing something that's horribly wrong.

re: Unglobalization

2004-10-15 16:26 • by Alex Papadimoulis
Phil, I believe you just provided the *perfect* link for "Yeeeee haw".

re: Unglobalization

2004-10-15 16:45 • by Tony Perrie
Also, this would have been way funnier if "them" was "'em" or as we from Appalachia say, "'em 'ere funny peacenik markings"

re: Unglobalization

2004-10-15 17:23 • by Ilya Haykinson
Actually if you search google you can see that ya'll is a perfectly accepted alternative to y'all, albeit less frequently used.

In my opinion, y'all should be used to mean "you-all" which is the plural of "you", while ya'll should mean "all y'all" which means "all you-all" which is the equivalent of "all of you[plural]".

Or something like that.

re: Unglobalization

2004-10-15 17:44 • by Phil Scott
Taking more pot shots:

sbBuilder? StringBuilderBuilder?

re: Unglobalization

2004-10-15 18:29 • by Jon
But they are two totally different words mind you. Y'all can be used when conversing with one to three people, or two people and up to two live animals.

"All y'all" is only used when talking to four or more people, six sheep, or any combination thereof.

Furthermore, pluralizing "all" so that you have "All's y'all" is used when discussing a large, yet abstract group of individuals (or livestock).

re: Unglobalization

2004-10-15 18:54 • by Michael Giagnocavo
"System.Text.Encoding.ASCII.GetBytes("öäü"), you only get the bytes for '?' "

-- Um, yes, what did you expect? What about when you pass in han characters to ASCII.GetBytes? What do you want it to do? Passing in invalid data should do silly magic behind the scenes. Be happy it's a '?' and not an exception :).

re: Unglobalization

2004-10-15 20:34 • by Phil McCracken
OK, I made a similar function that was used for a curse word filter. It replaced certain characters with other characters that they resembled like "|" with "I" so people couldn't pick a username like "SH|THEAD".

re: Unglobalization

2004-10-15 21:55 • by Josh
This is way off topic too, but about the Oracle guy. Follow the redneck link given above, and look on the right. One of the banners is for guide-horses, like guide dogs, only your kids can ride them.

http://www.guidehorse.org/

If this site isn't a sham, and this guy gets the rates that he's asking, I need to become an Oracle DBA.

re: Unglobalization

2004-10-15 23:44 • by Paul
You evil bastard... you know some dumb Americans will rip off this code for their webpages. Good for a laugh, but seriously, I think Americans vilify the French enough without making a concerted effort to corrupt their language (and that of every other european nation using the Latin alphabet as well).

re: Unglobalization

2004-10-16 00:14 • by Kylector
Oh goodness, this isn't about nationalism, the French, or the Americans. It's about laughable code. Who cares if a stupid American rips it off for some "evil" use, I'm sure stupid French people rip things off for "evil" use, too. It seems as though the "rest of the world" likes to call Americans arrogant and stupid, but the "rest of the world" sounds just as arrogant and stupid by saying it. Americans are not all the same, just like Europeans are not all the same, and Asians are not all the same. To lump every person in a country into a stereotype is about as arrogant and ignorant as you can get.

re: Unglobalization

2004-10-16 02:56 • by anonymous
What makes this especially funny is that the author's name is French.

Still, copy-and-paste coding isn't always bad. The above isn't any better or worse than defining a lookup table and looping through it to do the search-and-replace. Same difference, really.

The problem is that the code does what is intended. If it is really is a problem; that code just might have been written in a situation where 7-bit ASCII output really was required. Email addresses, for example.

I've seen plenty of good code and bad code. Truely bad code is a bunch of big-ball-of-mud files without any defined interfaces, can't be summed up in a single WTF post, and is all too common. At least the above mistake is sufficiently abstracted (in a function) that you can fix it in that one place without having to grok tens of KLOCs of other WTF.

re: Unglobalization

2004-10-16 04:57 • by foxyshadis
The funny thing is that PHP has a function that does exactly this, in one quick call:

strtr($string, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy')

And it's usually misused and abused there too. ("Hey, these characters look similar, they must be the same!") I'm willing to bet whoever originally wrote replaceDiacritics (which mostly aren't diacritics) used that exact example above to do it.

That is the winning link of the week, hands down.

re: Unglobalization

2004-10-16 07:17 • by Gary Wheeler
Nobody's commented on how slow this is going to be. You end up making umpteen loops through the string, one for each call of the Replace() function. If you're going to do something as stupid as this, you could at least do it efficiently (one loop through the string, a simple range check on the character, and then search through a lookup table).

re: Unglobalization

2004-10-16 09:26 • by God
1) "Removing French letters? I think this should be named freedomizeText() instead!"

2) "I think he also forgot to replace £,€,¥, etc. with $."

---

Evidently you are both americans.

1) The war was unjustified, pointless and whatnot, Chiraq was right, Blair and Bush were wrong

2) Unless it also includes the maths to convert the numbers that would be stupid, as £1 is not $1 its more like $1.80 cos your currency is pretty weak at this moment in time.

re: Unglobalization

2004-10-16 09:45 • by Jesus
As usual, God needs to get a sense of humor.

Why don't you just go and smite yourself.

re: Unglobalization

2004-10-16 11:20 • by Moses
Oh I dunno, there was that law about not coveting your neighbour's ass. That was quite funny - it would be if you saw my neighbour - oy vey.

re: Unglobalization

2004-10-16 15:22 • by Damian Cugley
I have no idea why they wanted this replaceDiacritcs function, but I can think of a plausible one: When I create web pages about photos of friends, I want to include their name in the URL, but URLs are restricted to plain US-ASCII, so a hypothetical Zoë László would have her photos in zoe-laszlo.html.

Python, Perl and PHP have easy string-map functions (for single-byte character data, at any rate); I don’t think dot-Net does.

re: Unglobalization

2004-10-16 15:51 • by phnk
Hope this function will help your 26-chars limited brains to understand the world a bit better :)

re: Unglobalization

2004-10-16 16:30 • by josh
Why phnk, whatever are you

re: Unglobalization

2004-10-17 07:28 • by phnk
Long story…

re: Unglobalization

2004-10-18 14:10 • by Baf
I've written a function similar to this myself. My reason: sorting. I needed a way to tell a rather naive system that Õhm comes between Oglethorpe and Oldman, and the easiest way to do this was to create sort keys without diacriticals (or punctuation, or capitalization). The sort keys were used solely internally, not displayed to the user.

So if you're laughing at the supposed American insularity of this code, I can't join you until I know what it was used for. There are legitimate reasons to want to strip diacriticals from a string.

I can, however, laugh at it for taking about 40 lines of code to do it.

re: Unglobalization

2004-10-18 17:53 • by Raymond Lewallen
biggest wtf here is returning the object sent by reference.

re: Unglobalization

2004-10-19 02:20 • by non_Dev
@Raymond Lewallen
Why is that a wtf? I think I speak for all of us when I say we've learned the importance of doing things at least twice.

Re: Unglobalization

2008-12-29 20:58 • by Daniel (unregistered)
> biggest wtf here is returning the object sent by reference.

Why do peeps keep saying this? Why would you assume that passing parameters by reference automatically means that you don't want to be able to use chaining?

Or am I misunderstanding VB here? (Having managed to stay clear of it so far)

Re: re: Unglobalization

2009-09-14 08:47 • by gurra g (unregistered)
285094 in reply to 24699
foxyshadis:
The funny thing is that PHP has a function that does exactly this, in one quick call:

strtr($string, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy')


Except it can't handle multi-character replacements such as æ->ae.

The code in the article is easier to maintain, less error-prone and can handle special cases such as æ, so you fail.
« PrevPage 1Next »

Add Comment