• (nodebb)

    In regard to the "ill-named" stuff and based on what the escape method is doing, I guess that whoever wrote the code does not speak English as their first language.

  • Markus Haug (unregistered) in reply to Jonathan Lydall

    I'd assume German, based on the first function.

    For the second function, I think they wanted to create a single function which standardizes string representations. This could be useful in sorting, for example.

    Granted, there are probably better ways to achieve that particular goal, but this is rather tame for a WTF, I would say.

  • Robin (unregistered)

    The best bit about the second method is surely that, if aEscape is false and the input string is no longer than the max length, an empty string is returned. I doubt that's what's intended, although I admit it's hard to be sure.

  • Apu (unregistered)

    The first method is also not perfect: it should check the case of the word where the letter is contained. For example, if we have a fully capitalized word "ÖLWANNE" the method would convert it to "OeLWANNE", which is obviously wrong.

  • Anonymous') OR 1=1; DROP TABLE wtf; -- (unregistered)

    Another bug in Escape() is that it doesn't handle denormalized representations of those 6 characters. Accented characters like Ä can be encoded in two ways, either as a single code point, or as two code points: a base character plus a combining accent mark. Compare:

    Ä - U+00C4 <LATIN CAPITAL LETTER A WITH DIAERESIS> Ä - U+0041 <LATIN CAPITAL LETTER A>, U+0308 <COMBINING DIAERESIS>

    Escape() only catches the first case and not the second.

  • Bjørn (unregistered)

    Well… seeing the first method it's clear that this handles German text, so I suppose the authors are German. What the second function does (or tries to do) is harmonizing the string length, in German: „vereinheitlichen“. Give that into a popular translator… https://dict.leo.org/englisch-deutsch/vereinheitlichen … and you get harmonize, standardize, unitize, unify, normalize. The author seems to have made a bad choice between four better ones – but in the end, this is 'only' bad translation, hardly a WTF.

  • RLB (unregistered)

    In fact, if that second assignment had merely said result = result.Substring(0, aMaxLength); instead of result = aInputString.Substring(0, aMaxLength);, I think the only odd thing would've been the names. (Oh, and, of course, the fact that the class has only two methods.)

  • Industrial Automation Engineer (unregistered)

    aEscape should be: anEscape aInput should be: anInputString

    that's Hungarian Notation for you. Aside: originally HN was intended to specify the unit/dimension of the variable. I dig that.It's a great idea. It was only later that the MS-Office product group / Petzhold completely misunderstood this and thought to to specify the type of the variable. That's a WTF of epic proportions.

  • Tim (unregistered)

    There's a whole genera of code WTFs called "functions that purport to handle a range of different input conditions but have only ever been tested with one specific subset". it makes you wonder why on earth someone would add the aEscape parameter when presumably it's never been called with aEscape=false

  • Foo AKA Fooo (unregistered) in reply to RLB

    Only if Escape is true (though apparently, that's always the case here). If it was false, result is still empty.

    Of course, many would never admit it, but ternary assignments provide an elegant way to make it obvious that every intermediate result and the final result are always assigned (and only assigned once, indicated here by using "const" which is otherwise not necessary), while also making the code shorter and easier to read (once you overcome your hatred of ternaries):

      const string result = aEscape ? Escape(aInputString) : aInputString;
      return result != null && result.Length > aMaxLength ? result.Substring(0, aMaxLength) : result;
    
  • AStoryFromTheBattleFront (unregistered)

    A number of years ago a client reported weird character representation in various places in their website. That led me down the path of investigating how characters were represented in and stored in mysql. Circa 2010 there was a certain meaning to utf8. In 2017 or so this had evolved to something else but with the same name. And I also realized we also now had things like emoji's in common use and likely could be used as input. So I settled on utf8mb4 as the choice for the database and converted over their content. Haven't seen a problem since. Did I do good? Or miss the mark?

  • ooOOooGa (unregistered)

    From just the name of the function, I would think that 'UnifyString' should be some sort of canonicalization function. Especially when paired with a function that replaces accented characters with combinations of standard characters.

  • Kordalia (unregistered) in reply to Foo AKA Fooo

    You can only put a constant into a const variable.

  • Foo AKA Fooo (unregistered) in reply to AStoryFromTheBattleFront

    Not sure how it's relevant here, but yes: In MySQL, "utf8mb4" means UTF-8. "utf" is an alias for "utf8mb3" (max. 3-byte sequences, i.e. BMP only, i.e. no emoji). At least, it's deprecated and planned to mean "utf8mb4" in the future. But for now, just use "utf8mb4".

  • (nodebb)

    I believe "Escape" is the result of the original author not being a native speaker of English.

    When you escape a string you modify it so that certain forbidden characters won't cause problems. This function takes certain extended characters and reduces them down to the normal ASCII set--sanitizing a string for some purpose. While we would be unlikely to call such a function "Escape" it feels like an edge case an ESL person could easily do and nowhere near WTF level.

  • MaxiTB (unregistered)

    huh? The a prefix is something that our client does as well, where I am luckily not there as a dev.

    However, this prefix has nothing to do with Hungarian notation. Just because people use camel case doesn't make it automatically Hungarian (which signals type and scope but there's no a for method arguments but a_).

  • MaxiTB (unregistered) in reply to Apu

    It's a matter of debate; in Austria back in the 90s your OeLWANNE was actually the correct way of doing it IIRC - it looks weird, but eh - that was ÖNORM for you.

  • MaxiTB (unregistered) in reply to Foo AKA Fooo

    Your code is incorrect. const is only allowed as a field member modifier in C# for literals. You can only have variables at scope level, which end up as reserved stack space.

  • MaxiTB (unregistered) in reply to LorenPechtel

    The term escape comes form the 80s/90s where escaping was used as a general term, when you converted a string to printable version of ASCII or EBCDIC (yeah, we are talking code page times long before unicode was even specified). So on one hand the dev which wrote this graduated before 2000s and never adopted his naming preferences. On the other hand someone from this era would know that the correct Hungarian notation for aEscape would actually be a_bEscape, so maybe it's a Java developer. Always a good bet too.

  • guest (unregistered) in reply to Anonymous') OR 1=1; DROP TABLE wtf; --

    Being pedantic, only replacing with ä with ae when it's a single codepoint is correct, as ä is considered a single, unmodified letter of a 30 letter alphabet in German, which has the ae alternative rule. If you get ä with a combining diacritic, you are speaking a different language where the ae rule does not apply, like ê in French which truly is an e with an added accent.

    Not that the original author would have thought about that.

    (Pet peeve: That means "Umlaut" does not refer to the two dots above the letter, but to the whole standalone letter itself)

  • Vilx- (unregistered)

    Three hardest problems in programming are: text, numbers and time.

  • MaxiTB (unregistered) in reply to Vilx-

    Actually true, false and filenotfound.

  • MaxiTB (unregistered) in reply to Anonymous') OR 1=1; DROP TABLE wtf; --

    .net UTF8 encoding doesn't support combining UTF characters correctly (like most other software); in other words they result in two different strings with the same output which may be not an issue in other languages but in .net strings are guaranteed to be interned on comparison. So in other words a compare results in false for the same thing.

    Considering they are so niche, I guess that is not a huge problem (I can't think of a single reason why you would ever use them in business logic, similar to not using UTC for date/times), so it's more a question of converting them at input (and possibly on output when required) than dealing with them everywhere else.

  • NoLand (unregistered)

    Regarding "ä" as a single code point versus two code points: The umlaut is a genuine character, equivalent to a union of the base character and an ending "e" (ä = ae), while two code points signify a trema, (also diaeresis) indicating that the vowel is not part of a digraph, but rather starts a new syllable. Meaning, as an umlaut, "ü" may be transformed to "ue", while as a diaeresis in "aü" it indicates that this is not to be pronounced as the digraph "au", but as distinct vowels, where a first syllable ends in "a" and "u" starts a new syllable.

    As for name, German – English dictionaries actually provide "unify" as a translation for "vereinheitlichen" (to harmonize), you may have learned this in school.

  • (nodebb)

    The codebase is full of stuff like this, especially ill-named and illogical stuff. I already came close to sending in code several times, but this high-density bug-fest (which is solely running on pure luck and ignorance) pushed me over the edge.

    Working on an industrial Fortran code base. I came close to sending the whole code base on many occasions, but I am afraid of legal repercussions and of giving the reader an aneurysm.

  • (nodebb) in reply to Anonymous') OR 1=1; DROP TABLE wtf; --

    Whoever uses the debi... sorry, denormalized version deserves what they get.

  • can't think of any more stupid names (unregistered) in reply to Tim

    We were once greeted with a global email from one of the offshore developers who was delighted with himself for having developed a method for validating some parameter in the banking industry -- I completely forget which. It was one of those things, like an ISBN, which includes redundant check bytes so as to provide a modicum of transmission error checking.

    I took a look at it, as suggested, and compared it against the spec, available on Wikipedia, to check whether his algorithm was in fact valid. It was badly non-compliant. I sent an email back, appropriately polite and deferential, suggesting that certain codes were not being checked correctly, and there were several cases where a false positive or a false negative would be returned.

    He replied hotly, declaring that I didn't know what I was talking about, because his function passed all (3) of his test cases in his unit tests, and perhaps I should go back to school and do a basic programming course.

    I had no dog in this particular race, so I let him get on with it.

  • RLB (unregistered) in reply to guest

    (Pet peeve: That means "Umlaut" does not refer to the two dots above the letter, but to the whole standalone letter itself)

    Extra pedantically: Umlaut properly means the process by which certain vowels are, regularly or semi-regularly, changed into other vowels in some languages, most typically in the Germanic ones. The letter is properly called an Umlautvokal and the dots an Umlautzeichen.

  • I'm not a robot (unregistered) in reply to guest

    That's not how it works. The choice of whether to use combining or precomposed characters doesn't have anything to do with whether any particular language considers the result to be a distinct letter or not.

  • Snarky (unregistered) in reply to Robin
    Comment held for moderation.
  • VstSoftCrack (unregistered)
    Comment held for moderation.

Leave a comment on “Unification of Strings”

Log In or post as a guest

Replying to comment #582185:

« Return to Article