• (nodebb)

    I like how converting from one encoding to a different encoding also requires adding a newline to the end of the string.

  • (nodebb)

    P-O did find that converting this function to a no-op had no impact on the application.

    That seems ... unlikely ... unless the application was targeted to take into account only data written in 7-bit clean ASCII, seeing as how "é" in ISO-8859-1 becomes "é" if you blindly encode it as UTF-8 and then don't interpret it religiously as UTF-8. (And "é" - after the noopification of the function - isn't going to be valid UTF-8 if you feed it to something expecting UTF-8).

    And, it's also unlikely to have no impact if something depended on the newline being there...

  • TheCPUWizard (unregistered) in reply to Steve_The_Cynic

    @Steve === Yup, having lots fo fund with that... dealing with a program where "strings" are stored as arrays of bytes.. different strings have to be interpreted differently to make them strings... so while encoding is typically done at the file level (ideally with a BOM) this can not be (easily) refactored as there are no editors which support having ANSI [MBCS], UTF-8 andUTF-16 in the same file....

  • (nodebb)

    Sounds to me like a case of premature internationalization. One dev who thought they understood i18 decided to start doing stuff the "modern" way. The quickly lost interest because management and all the coworkers were totally US-centric. Good thing the guy gave up before he filled the codebase with even more exciting confusion. Or is that confused excitement?

  • Dan (unregistered)

    "ISO-8859 is, notably, not UTF-16, and in fact, isn't 16 bits at all- it's just a different 8-bit encoding from UTF-8."

    I agree, but not completely: ISO-8859 is not just a different encoding from UTF-8. ISO-8859 is a series of 8-bit encodings, none of which is a 16-bit encoding, let alone the particular 16-bit encoding known as UTF-16.

    Ranked by likelihood of WTF, character encoding comes in second (behind time zones, but we knew that).

  • (nodebb) in reply to WTFGuy

    "confused excrement" surely?

  • (nodebb)

    Wouldn't surprise me if it was originally written to encode as UTF-16, and later it was decided/realized that they needed ISO-8859 instead, and changed it here instead of doing it right and documenting it. The note that explains the use of ISO-8859 probably wasn't entered when the change was made, but was someone who came through after the fact and didn't know the change had ever been made.

  • Brian Boorman (unregistered) in reply to Steve_The_Cynic
    That seems ... unlikely ... unless the application was targeted

    Or unless the function is never actually called anywhere else in the code base and OP just happened upon this while looking for something else.

  • Abigail (unregistered)

    UTF-8 is not an 8 bit encoding. It encodes some characters as 8 bits, but most characters require a multiple of 8 bits. That it cannot be an 8-bit encoding is obvious once you realize UTF-8 can encode every Unicode character, and there are a lot more than 256 of them.

    ISO-8859 is 8 bits: every character is encoded in 8 bits. It doesn't claim to be able to encode more than 256 different characters.

  • (nodebb)

    I'm hoping that the alteration to the function having no impact was because the function was never called. That would be (or at least lead to) a happy outcome.

  • Officer Johnny Holzkopf (unregistered) in reply to Dan

    ISO-8859-1 indicates western europe encoding (8 bit), so things like é, Ö, µ or ß would be part of it. However, of course not all valid UTF-8 could be converted to a matching 1:1 counterpart in ISO-8859-1. Maybe the original coder (or their mangler) dod not understand that you cannot simply exchange one for the other? Rule: "One does not simply convert UTF-8 to ISO-8859-1."

  • (nodebb)

    I'm inclined to agree with Llarry, this is a bit of legacy code that does nothing these days because it recognizes that it's already converted and NOPs. It might break if it got it's hands on some archaic data.

  • löchlein deluxe (unregistered) in reply to Steve_The_Cynic

    My money's on "at that point in processing, data that is not 7bit clean would still be a base64-encoded string".

  • (nodebb) in reply to Abigail

    UTF-8 uses 8-bit code units, in sequences of one to four of them, to represent Unicode code points, which is why it is regarded as an "8-bit encoding" distinct from the likes of UTF-16 which uses 16-bit code units (which is not the same as a sequence of two 8-bit units in UTF-8) to represent Unicode CPs.

  • (nodebb)

    convert LDAP strings

    I'd not be surprised if their LDAP strings aren't just 7-bit ASCII to begin with, and never even ISO-8859, which would explain why the method can be replaced by a no-op.

    "LDAP strings" are often a misnomer for "usernames", and in large orgs it's just easier to use just 7-bit ASCII because lower denominator makes IT folks happier to deal with an heterogeneous park of hardware. Especially if you're in an international/US-centric company.

  • John G. (unregistered)

    I'm surprised how often I see the results of this on sites and apps misrepresenting the "correct" characters or punctuation as garbage...

Leave a comment on “UTF-16 Encoding”

Log In or post as a guest

Replying to comment #:

« Return to Article