- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
I like how converting from one encoding to a different encoding also requires adding a newline to the end of the string.
Admin
That seems ... unlikely ... unless the application was targeted to take into account only data written in 7-bit clean ASCII, seeing as how "é" in ISO-8859-1 becomes "é" if you blindly encode it as UTF-8 and then don't interpret it religiously as UTF-8. (And "é" - after the noopification of the function - isn't going to be valid UTF-8 if you feed it to something expecting UTF-8).
And, it's also unlikely to have no impact if something depended on the newline being there...
Admin
@Steve === Yup, having lots fo fund with that... dealing with a program where "strings" are stored as arrays of bytes.. different strings have to be interpreted differently to make them strings... so while encoding is typically done at the file level (ideally with a BOM) this can not be (easily) refactored as there are no editors which support having ANSI [MBCS], UTF-8 andUTF-16 in the same file....
Admin
Sounds to me like a case of premature internationalization. One dev who thought they understood i18 decided to start doing stuff the "modern" way. The quickly lost interest because management and all the coworkers were totally US-centric. Good thing the guy gave up before he filled the codebase with even more exciting confusion. Or is that confused excitement?
Admin
"ISO-8859 is, notably, not UTF-16, and in fact, isn't 16 bits at all- it's just a different 8-bit encoding from UTF-8."
I agree, but not completely: ISO-8859 is not just a different encoding from UTF-8. ISO-8859 is a series of 8-bit encodings, none of which is a 16-bit encoding, let alone the particular 16-bit encoding known as UTF-16.
Ranked by likelihood of WTF, character encoding comes in second (behind time zones, but we knew that).
Admin
"confused excrement" surely?
Admin
Wouldn't surprise me if it was originally written to encode as UTF-16, and later it was decided/realized that they needed ISO-8859 instead, and changed it here instead of doing it right and documenting it. The note that explains the use of ISO-8859 probably wasn't entered when the change was made, but was someone who came through after the fact and didn't know the change had ever been made.
Admin
Or unless the function is never actually called anywhere else in the code base and OP just happened upon this while looking for something else.
Admin
UTF-8 is not an 8 bit encoding. It encodes some characters as 8 bits, but most characters require a multiple of 8 bits. That it cannot be an 8-bit encoding is obvious once you realize UTF-8 can encode every Unicode character, and there are a lot more than 256 of them.
ISO-8859 is 8 bits: every character is encoded in 8 bits. It doesn't claim to be able to encode more than 256 different characters.
Admin
I'm hoping that the alteration to the function having no impact was because the function was never called. That would be (or at least lead to) a happy outcome.
Admin
ISO-8859-1 indicates western europe encoding (8 bit), so things like é, Ö, µ or ß would be part of it. However, of course not all valid UTF-8 could be converted to a matching 1:1 counterpart in ISO-8859-1. Maybe the original coder (or their mangler) dod not understand that you cannot simply exchange one for the other? Rule: "One does not simply convert UTF-8 to ISO-8859-1."
Admin
I'm inclined to agree with Llarry, this is a bit of legacy code that does nothing these days because it recognizes that it's already converted and NOPs. It might break if it got it's hands on some archaic data.
Admin
My money's on "at that point in processing, data that is not 7bit clean would still be a base64-encoded string".
Admin
UTF-8 uses 8-bit code units, in sequences of one to four of them, to represent Unicode code points, which is why it is regarded as an "8-bit encoding" distinct from the likes of UTF-16 which uses 16-bit code units (which is not the same as a sequence of two 8-bit units in UTF-8) to represent Unicode CPs.
Admin
I'd not be surprised if their LDAP strings aren't just 7-bit ASCII to begin with, and never even ISO-8859, which would explain why the method can be replaced by a no-op.
"LDAP strings" are often a misnomer for "usernames", and in large orgs it's just easier to use just 7-bit ASCII because lower denominator makes IT folks happier to deal with an heterogeneous park of hardware. Especially if you're in an international/US-centric company.