The Daily WTF: Curious Perversions in Information Technology

2016-04-19 Reply Admin

What is particularly amusing is the way the if .. elseif statements are configured so as to read like a car starting up on a cold morning.

If elseif. (Dammit.) If elseif elseif. (Oh come on, old girl.) If elseif elseif elseif elseif elseif elseif elseif elseif elseif ... (ahh, that's better.)

2016-04-19 Reply Admin

"so that “ä” and “å” were treated as the same character when searching"

But why? å and ä and a are different characters. Anyone searching for "kräm" (paste or cream in Swedish) wouldn't want results for "kram" (hug) or "kråm" (entrails). Please don't do this.

2016-04-19 Reply Admin

Names, dear, names. In my typeahead I want "Müller" to generate a hit on "Müller", "Muller", and "Mueller". And vice versa all around.

2016-04-19 Reply Admin

But why? å and ä and a are different characters. Anyone searching for "kräm" (paste or cream in Swedish) wouldn't want results for > "kram" (hug) or "kråm" (entrails). Please don't do this.

Depends. I don't have either å or ä (or ) on my keyboard, but if I want to spend my vacation in Västerås, I expect to find it searching for Vasteras.

2016-04-19 Reply Admin

Just convert anything outside the decimal range 32 ... 127 into an _. Job done. Who wants all these fscking Umlaute and Chinese funny symbols. Todays computers still work best with 7 bit ASCII characters. Imagine all the saved memory and headaches dealing with UTF-8, which is a devil's invention.

PWolff · 2016-04-19 Reply Admin

We should have avoided the invention of lower-case letters in the first place.

And stuck to the good old about 20 letter Phoenikian alphabet.

Btw, dear Mr Contractor, we want to localize this app to Greece and Russia.

2016-04-19 Reply Admin

At least Perl 6 is doing strings right, Remy.

2016-04-19 Reply Admin

Clunky but not really a WTF. Nesting the "else"s would have been a WTF

2016-04-19 Reply Admin

At least he is documenting the code!

2016-04-19 Reply Admin

If ASCII was good enough for God and Jesus to speak in the King James Bible, it's good enough for everyone.

2016-04-19 Reply Admin

Call me Mister Naïve, sitting here in my internet café, but can't we just limit our œuvres to make sure we coördinate our operations so they work in the English language only? Then such irritations vanish with nary a soupçon of trouble -- no awkward diæreses or other diacritics, just good ole ASCII.

BaconBits · 2016-04-19 Reply Admin

But why? å and ä and a are different characters. Anyone searching for "kräm" (paste or cream in Swedish) wouldn't want results for "kram" (hug) or "kråm" (entrails). Please don't do this.

And if you're looking for a Spanish man named "Víctor" and don't know it's not spelled "Victor"?

Data systems understand characters. They don't understand languages. Diacritical muting on character collation is very, very useful. Almost every database supports both accent sensitive and accent insensitive searches. Unicode has all kinds of collation charts (http://unicode.org/charts/collation/) because when people search for things they often want to search based on the shape of the letter, not the exact character.

2016-04-19 Reply Admin

That's ridiculous. I can perform a search in a database and perform standard comparisons and ordering regardless of whether it's an a, à, ä, å, or ă.

If your code or database isn't capable of doing this, then consider using something else that can.

BaconBits · 2016-04-19 Reply Admin

That's ridiculous. I can perform a search in a database and perform standard comparisons and ordering regardless of whether it's an a, à, ä, å, or ă.

Well, you're only able to do that because someone else has written code that knows how to do accent insensitive searches. This is writing code that itself will handle it.

Say you're writing code that needs to translate Unicode names into RFC 5322 compliant email addresses because you know you'll have to support systems that don't support internationalized email addresses for probably the next decade (RFC 6530 is still at PROPOSED status). That means stripping diacritics since RFC 5322 specifies only a very limited number of valid characters in the US ASCII code page for addresses.

2016-04-19 Reply Admin

So if I want to hug the cream coming out of entrails i would kram kråm kräm?

2016-04-19 Reply Admin

Simple solution...convert everything to EBCDIC. If it was good enough for IBM mainframes, it should work everywhere!!

2016-04-19 Reply Admin

Is failing to convert Ĥ and Ħ better or worse than substituting them with 'D'?

mihi · 2016-04-19 Reply Admin

You would be surprised to hear how many (sometimes US-centric, sometimes not) companies do not support anything except (uppercase) A-Z, 0-9 and space in their APIs (especially bad are airlines...).

Which means that if you want to book a flight and your name contains any non-ASCII chars, you theoretically have to ask the user to spell his name in ASCII again (especially for Cyrillic names there is no other way as transliteration depends on the language used). But for giving an initial value, a "good" way (which should handle all the cases in the submitted code and even more) is to convert to NFKD form (java's Normalizer class can do that for you), uppercase the result and then strip all the nonsupported characters (which also strips all the Combining diacritics that were separated from the base letters due to NFKD decomposition).

In case you want something "better" (handling more cases), there is a "unidecode" library (available at least for Perl, Python and Java), which basically looks like the sample code posted, but using a bigger and more structured lookup table, and containing a lot more characters (including Chinese ones and currency symbols) - but in practice (at least when dealing with typical European Latin names), the normalize and strip approach works well enough.

Since I'm German I also know the exceptions for German names (ae oe ue ss), which can be hard-coded, but may not work for Turkish or Swedish people (so don't use them everywhere...)

2016-04-19 Reply Admin

There's another issue with the little snippet used at the top (repeated here for your viewing pleasure):

if (index == 292 && index == 294) { resultCharacterIndexes.add(68);//H }

ASCII 68 is not "H" - it's "D". In the full code block, "D" is listed, and it uses 68. Of course, neither of those will ever appear in the output, as they both require the quantum version of the index variable that can hold two values at once, but still, another example of why magic numbers are bad.

2016-04-19 Reply Admin

and no comments about "source"? Last time I checked, there are still plenty of good variable names available for the taking -- I have a special today on the name "result"; just $5.00 and you can have "result" and any of the variations possible by introducing diacritical marks.

2016-04-19 Reply Admin

My version of the King James Bible has ligatures and diphthongs, which are definitely not in the Ascii character set.

also - pretty sure that third image wasn't a cookie, as it has two of them with cream between them.

2016-04-20 Reply Admin

Ah, you work for the local council. For my next passport, they will want to see my birth certificate, since their software was 7-bit only, and now they don't know if your name is Müller or Mueller. Or probably M_ller.

Scarlet_Manuka · 2016-04-20 Reply Admin

TRWTF is Remy's comment:

Why on earth would anyone think that? Obviously they wanted to convert Ĥ and Ħ into H but stuffed it up - probably copying from the code for Ď and Đ since that has the same logical error and (as pointed out by IDatecapitalh and Moosie) the output character is "D" rather than "H".

Oh, I see - Remy was misinterpreting 292 and 294 as hex rather than decimal and thought they meant ʒ and ʔ rather than Ĥ and Ħ. That also explains the "may ʒ and ʔ you" bit, which seemed bizarrely phrased when I read it. (Of course, bizarre phrasings are not exactly unknown on TDWTF.)

The main trouble with this sort of approach, apart from being a pain in the arse to do, is that you're inevitably going to miss weird possibilities - as soon as someone shows up with a ƀ they're going to have to add another case to their statement. I do agree with Remy's main point: if you have to do this sort of thing, you should be trying really really hard to use a collation that someone else has already written, instead of creating your own from scratch.

2016-04-20 Reply Admin

Somehow I feel that if you--as a Swede interested in your country's history--wanted to find information about 17th century Swedish nobles and entered 'Claes Baner' as the search term, you would expect to see Claës Banér among your search results.

And in the case the forum messes up diacritics, it's 'Claës' with edieresis and 'Banér' with eacute.

2016-04-20 Reply Admin

That's just crazy talk - the language I plan on using can't even left pad a string.

2016-04-20 Reply Admin

Which means that if you want to book a flight and your name contains any non-ASCII chars, you theoretically have to ask the user to spell his name in ASCII again (especially for Cyrillic names there is no other way as transliteration depends on the language used).

Practice doesn't match theory. Some airlines' web sites told my wife to input her name exactly as it appears in her passport. Her name in her passport includes Ñ. Some airlines' web sites barfed when she obeyed instructions.

In case you want something "better" (handling more cases), there is a "unidecode" library (available at least for Perl, Python and Java), which basically looks like the sample code posted, but using a bigger and more structured lookup table, and containing a lot more characters (including Chinese ones and currency symbols)

Fail. 神戸 can be either Kobe or Kanbe. 幸恵 can be either Yukie or Sachie. In such cases a passport includes an ASCII spelling as well as the original name, but there's no way a library will guess which.

2016-04-20 Reply Admin

Call me Mister Naïve, sitting here in my internet café, but can't we just limit our œuvres to make sure we coördinate our operations so they work in the English language only? Then such irritations vanish with nary a soupçon of trouble -- no awkward diæreses or other diacritics, just good ole ASCII.

That makes ¢.

No wait, that's American language not English language. Time to go £ some ¢ into ʃomeone.

2016-04-21 Reply Admin

err... setlocale(whatever) and then something something plaintext(something). CBA to actually look it up, but somebody being tasked to actually write the code should do automatically before even flexing their fingers.

2016-04-21 Reply Admin

TRWTF is the syntax highlighting that makes the if-condition, its closing bracket, and the opening curly all look like comments.

2016-04-21 Reply Admin

In America, we'd # some ¢ into them, even if we had to take them to the [http://thedailywtf.com/articles/5_years_C-pound_experience](C pound) to do it.

2016-04-22 Reply Admin

As a Swede living in a small town next to Västerås, I fail to see why anyone would ever want to spend their vacation there, it's probably the ugliest city in Sweden.

But I see your point.

2016-04-25 Reply Admin

I would say it's also the wrong solution to the problem. Why not do like a soundex search or something like that instead of doing some arbitrary conversion that might change the semantics.

2016-04-25 Reply Admin

You, sir, are evil.

urkerab · 2016-04-28 Reply Admin

Remy was misinterpreting 292 and 294 as hex rather than decimal

He can't even use the excuse that he accidentally wrote them as ʒ and ʔ in the source rather than Ĥ and Ħ.

Addendum 2016-04-28 06:28: Edit: The markdown previewer I used doesn't do &#; escapes, sigh.

2016-05-04 Reply Admin

This is what spelling correction is for. Converting ä to a becomes very annoying for the native speaker. What if you're looking for Muller but the search engine insists on giving you Müller? How would you tell it to stop doing that?

2016-05-04 Reply Admin

So you want a search engine with spelling correction. Not one that mixes up different characters.

2016-05-04 Reply Admin

And as a native speaker of a language with å, ä and ö, I can tell you that it is very annoying when search engines mix them up. Spelling correction is much more useful.

2016-05-04 Reply Admin

Yeah! Or rather: "krama kråmkräm" which actually sounds disgusting.

2016-05-04 Reply Admin

Yes, I would want it to suggest the correct spelling (or even return those hits - with an option to search for claës banér instead). I would bot want it to confuse the letters. What if there are hundreds of banérs and one baner? How would I tell it to search for baner if it doesn't know the difference?

2016-05-17 Reply Admin

Because we speak American here.

And It's Collated

Leave a comment on “And It's Collated”