The Daily WTF: Curious Perversions in Information Technology

2007-03-29 Reply Admin

anne:
Has anyone noticed how many of these involve string processing? I suppose string processing is kind of annoying sometimes, but why does it produce so many glorious WTFs?
The infinite loop is teh awesome. DoS, anybody?

Sorry, forgot to quote

2007-03-29 Reply Admin

AdT:
Nonsense! What he was trying to do is converting the HTML entities for single and double quote to ordinary characters. What do ' and " in JavaScript have to do with ASCII?

There's no such thing as an "ordinary character". Every binary representation of a "character", which should be referred to as a "glyph" in this context, is the binary representation of that glyph in a specific character set. If you go back far enough, you get IBM character sets that were 5-bit. These evolved into the 7-bit US-ASCII character set. Effectively every other character set in wide use today uses the 128 characters specified by US-ASCII as the "first" 128 characters (that is, binary 00000000 through 01000000 as the least significant byte, with all other bytes being either 00000000 or the extension byte variant thereof, depending on the character set).

US-ASCII is the only character set allowed by many original versions of most core internet protocols for commands, and often data, too (e.g. SMTP). The most commonly used character set for English operating systems and applications is ISO-8859-1 (a.k.a. Latin-1), which is an 8-bit character set mirroring US-ASCII for all values below 128, and providing 128 commonly-used (in English and Latin-based cultures) glyphs with values between 128 and 255.

Now, you do raise a good point: JavaScript may define, as its allowed character set, one of the Unicode Transformation Format (UTF) character sets, rather than ASCII. I don't know for sure. If so, I was incorrect in saying that the author of the code was trying to convert these HTML entities into ASCII. However, the fact remains that he was not trying to convert anything to or from Unicode. Moreover, I reiterate that there is no such thing as an "ordinary character".

2007-03-29 Reply Admin

I would like to celebrate this very first post in a long long time without a unclever captcha comment to be found anywhere. Well done. I can actually just enjoy the comments.

2007-03-29 Reply Admin

Michael Houghton:
'How did their data get “'” in the first place?'
TinyMCE does this in certain circumstances. At least, I find myself having to clean them up, for reasons I am not clear on. It would certainly explain why there was a need to process them in javascript.

Ouch!! why did you have to say "TinyMCE"? Seriously, the code cited in the article is rather benign--it will remain lost in obscurity in whatever little application it's written for. TinyMCE, OTOH, is in the wild, converting and reconverting until it has turned users intent into garbage wherever it goes.

2007-03-29 Reply Admin

Anonymous Tart:
Been there done that don't care any more:
I found something like this once (but with variables - that were set far in advance of the call - instead of literals). The only thing I can figure is that it used to do the replacements, and at some point, they became unnecessary. The subsequent coder didn't realize they were just undoing a preious replace, and the whole thing could have been eliminated, so they just tacked on additional calls to replace to "fix" the broken string.
String wtf = someString.replace("A","B").replace("C","D").replace("B","A").replace("D","C");
Yeah ... dont maintain any of my code thx. Complete noop innit....
>>> var a = "BBAA"; var b = "AABB";
>>> a.replace("A","B").replace("C","D").replace("B","A").replace("D","C")
"ABBA"
>>> b.replace("A","B").replace("C","D").replace("B","A").replace("D","C")
"AABB"

String does not have a replace(String, String) method. You example uses the replaceFirst method. If the original poster meant to use replace(char, char) then both examples would return the value of "AAAA".

2007-03-29 Reply Admin

Mystified:
There's no such thing as an "ordinary character".

Yes, there is - in JavaScript aka ECMAScript, a character is simply a string of length one. What we call ' here is the single quote character referred to in the ECMAScript standard.

Mystified:
Now, you do raise a good point: JavaScript may define, as its allowed character set, one of the Unicode Transformation Format (UTF) character sets, rather than ASCII. I don't know for sure.

ECMAScript requires all implementations to conform to Unicode as far as this is applicable. However, what character set or encoding an implementation uses internally is completely irrelevant.

Mystified:
However, the fact remains that he was not trying to convert anything to or from Unicode.

Actually, since ECMAScript requires implementations to be Unicode conformant, and it's easy to convert JavaScript character to and from UCS codes, one could argue that he is converting these entities to UCS. Though that is technically not identical with Unicode, UCS is the character set defined by the Unicode Consortium and even the ECMAScript standard refers to UCS as the "Unicode character set" (technically it's the Universal Character Set).

Mystified:
UTF-X are also *NOT UNICODE*. They are character sets which provide specifications for the binary representations of glyphs contained in the Unicode canon.

Methinks you are yourself confused about the terms. UTF-X are not character sets, they are encodings. The Unicode standard defines sets of glyphs, a glyph being the thing you see in a browser, with some level of abstraction applied to it, i.e. Unicode considers ', ' and ' to be the same glyph in different typefaces. UCS as a character set defines code values for these glyphs. Various encodings like UTF-8 and UTF-16 typically define binary representations of those code values.

2007-03-29 Reply Admin

h@x0r:
since this is javascript, you'd only be DoS'ing yourself...

Worse, you would be DoSing your visitors / clients.

2007-03-29 Reply Admin

AdT:
*snip*

There is a conflict of terminology, here, exacerbated by ambiguous colloquial usage, which is confusing what would, likely, otherwise be agreement. My original objection was to the identification of "Unicode" as being the origin of a conversion. Misunderstanding, and the resultant misuse, of the term "Unicode" is a major peeve of mine, having spent over a year dealing with the vagaries and minefields of international email.

With regard to your statements, in large part I am in agreement. While it is more accurate, indeed, to use "character encoding" to refer to the Unicode Transfer Formats, given modern usage, the phrases "character encoding" and "character set" are effectively interchangable, subject to context. I chose "character set" as the more universally understandable phrase.

Furthermore, after reading the portion of the ECMAScript specification regarding lexical grammar, it is clear to me that it does call for implementations to honor the full range of the Unicode glyph repertoire, but places no restrictions on the character encoding mechanism used to implement that repertoire. This is unusual among standards, and I'm delighted to see it in this case. I still, however, object to the phrase "ordinary character". While your meaning is now clear to me, I insist that such phrases only further encourage ingrained misunderstandings of the vital distinction between a "character" and a "glyph". Pedantic? I wish it were; unfortunately, so many people misapprehend the concept of Unicode that "standard" implementations thereof are, often, horribly flawed (consider even recent versions of the ubiquitous AspNetEmail and AspNetMime which treated the UTF-16 byte sequences of .NET Unicode strings as ASCII byte sequences, without any transcoding). If those who understand Unicode, among whom you seem to be, continue to use colloquial parlance, we'll only encourage common misunderstanding.

Yes, I do believe accurate vocabulary is that vital. Just think back to your last requirements meeting.

P.S. I'm going to avoid any discussion of the ECMAScript specification's unfortunate (and inaccurate) backronym of "UCS".

2007-03-29 Reply Admin

Mystified:
There's no such thing as an "ordinary character".

Wrong. Every character is "ordinary". Or are you an ASCII chauvinist? :-)

Every binary representation of a "character", which should be referred to as a "glyph" in this context,

Wrong again! A glyph is a particular graphical representation of a character (i.e. character + font == glyph, for suitable meaning of "+" of course)

Effectively every other character set in wide use today uses the 128 characters specified by US-ASCII as the "first" 128 characters (that is, binary 00000000 through 01000000 as the least significant byte, with all other bytes being either 00000000 or the extension byte variant thereof, depending on the character set).

Except the EBCDIC-based sets, which seem to still have a following. (That's a WTF! all of it's own, but draws us away from the real point.)

The most commonly used character set for English operating systems and applications is ISO-8859-1 (a.k.a. Latin-1), which is an 8-bit character set mirroring US-ASCII for all values below 128, and providing 128 commonly-used (in English and Latin-based cultures) glyphs with values between 128 and 255.

8859-1 is very common on Unix systems in the Americas (Windows uses its own schemes) but is now uncommon in Europe, where 8859-15 has taken over.

I was incorrect in saying that the author of the code was trying to convert these HTML entities into ASCII.

At a guess, the original code author was trying to convert things into our favourite coding system, WTF-8!

However, the fact remains that he was not trying to convert anything to or from Unicode.

More seriously, since he was converting things into the character domain, he was converting to Unicode (since Unicode does specify what the code-point for each character is; there are character sets that do not, but they're not used with computers). That would then presumably get converted into a byte sequence at some point, probably using UTF-8, but there are a few other schemes that are less common (because they're more troublesome).

Old Wolf · 2007-03-29 Reply Admin

Mystified:
Let's all, please, divorce ourselves of the idea that Unicode is a character set

Seeing as Unicode is in fact a character set, that would be a silly course of action.

Unicode is an attempt at a canonical set of all written glyphs, no matter what language or languages they occur in, with a unique value assigned to each.

That is the precise definition of 'character set'. A set of glyphs with a unique value assigned to each. For example, 'A' is assigned the value (a.k.a. 'code point') 65 in the ASCII character set, and also 65 in the Unicode character st. But it is assigned a different value in the EBCDIC character set.

A character set is a specification of binary representations of a set of glyphs.

Rubbish. For example, ASCII does not specify anything about how 'A' is to be represented, other than by the value 65. If you used ASCII on a ternary computer then it would not have a binary representation.

UTF-X are also *NOT UNICODE*. They are character sets which provide specifications for the binary representations of glyphs contained in the Unicode canon.

UTF-x are character encoding schemes, which specify how you can use a sequence of octets to encode the value of a character in the Unicode character set.

What this guy was trying to do was convert the *HTML representation* of the Unicode value of a given glyph to an *ASCII character set* representation of the same glyph.

ASCII has nothing to do with this, the objective would be just the same on an EBCDIC system or a Unicode-like system (eg. Java).

Why is Unicode so hard?

Because of people like you who have no idea what you are talking about, but try to sound authoritative.

2007-03-29 Reply Admin

what ever happen to the Replace function?

function convertSingleQuoteAndDoubleQuoteAsciiToCharacters(str){ return Replace(Replace(str, "'", "'") , """, """); }

2007-03-29 Reply Admin

Brendan:
what ever happen to the Replace function?
function convertSingleQuoteAndDoubleQuoteAsciiToCharacters(str){ return Replace(Replace(str, "'", "'") , """, """); }

I presume you're talking about the javascript string object's replace function. un for to notly unless you want to get all regular expressions with that it will only replace the first instance of the supplied string.

Javascript is a very fun language. I'd compare it to running full speed through a minefield with only one eye.

nwbrown · 2007-03-29 Reply Admin

h@x0r:
since this is javascript, you'd only be DoS'ing yourself...

No, you would be DoS'ing every poor soul who visited your page with JavaScript enabled.

2007-03-30 Reply Admin

it was not so long ago that javascript did not have a replace function and perhapse more importantly all searchs for "javascript replace function" returned "javscript does not have a replace function", of course that was before google

2007-03-30 Reply Admin

Actually wont the strings alternate between 'foo' and &#39'#39;

Still an infinate loop though.

Mcoder · 2007-03-30 Reply Admin

anne:
Has anyone noticed how many of these involve string processing? I suppose string processing is kind of annoying sometimes, but why does it produce so many glorious WTFs?

WTF-prone people probably knows better than dealing with pointers...

2007-03-30 Reply Admin

I ran into a problem with the JavaScript .replace function where it would only replace the first instance of the string so replacing OMG with LOL: OMG you suck OMG -> LOL you suck OMG

Although a few minutes searching at google brought up the solution.

2007-03-30 Reply Admin

sorry wrong language, I though this was vb

MaGnA · 2007-03-30 Reply Admin

Stevenovitch:
Brendan:
what ever happen to the Replace function?
function convertSingleQuoteAndDoubleQuoteAsciiToCharacters(str){ return Replace(Replace(str, "'", "'") , """, """); }

I presume you're talking about the javascript string object's replace function. un for to notly unless you want to get all regular expressions with that it will only replace the first instance of the supplied string.

Javascript is a very fun language. I'd compare it to running full speed through a minefield with only one eye.

Some fun:

function convertSingleQuoteAndDoubleQuoteAndAnythingElseAsciiToCharacters(str)
{
    return str.replace(/&#(\d+);/g, function () { return String.fromCharCode(arguments[1]); })
}

2007-03-30 Reply Admin

Psst: (if ASP.NET) Server.HtmlEncode()

2007-03-30 Reply Admin

nwbrown:
h@x0r:
since this is javascript, you'd only be DoS'ing yourself...

No, you would be DoS'ing every poor soul who visited your page with JavaScript enabled.

You missed the context of my comment, I was replying to a person that thought you could DoS them by visiting their site and causing an infinite loop. I was not commenting in the context of the app creators.

2007-03-30 Reply Admin

Well, I had to make a similar funcion this week to fix the "beautified" quotes and apostrophes inserted by Wordpress when converting a post to HTML.

The name of the function? fixWPgayness().

2007-03-30 Reply Admin

I donno. This is buggy obviously. The infinite loop should have been tested and found when they wrote the code since the only reason those loops are there is to handle the exact case they don't work for.

And of course they should use built in functions rather than reimplementing them since they obviously failed to reimplement them properly.

But this doesn't exactly make me cringe like many wtfs. It is fairly straightforward and easy to read. I can tell what it is doing and why and it is easy to see how it is broken. It is a reasonable, albeit inefficient way to do the task at hand if one didn't know of the 'replace' method. But an O(n*m) function isn't really that bad when n is typically less than 10 and m is typically 1 or 2. (n being the length of the string and m being the number of replacements +1)

2007-03-31 Reply Admin

Ahh, the classical confusion of in and output variables. The WTF is alaways that both exist. There is no trouble using

modify(String str) { .. return str; }

altough, is may be more understandable to use

modify(const String inputString) { String outputString = inputString; .. return outputString; }

alas, not neccessary in this case

2007-04-02 Reply Admin

wasnotlong:
it was not so long ago that javascript did not have a replace function and perhapse more importantly all searchs for "javascript replace function" returned "javscript does not have a replace function", of course that was before google

I probably just missed the <sarcams> tags there, but just in case:

Method of the String object (see also replace as a method of location)

Replaces a specified substring with another.

Syntax

stringObj.replace(regexp, newText)

...

Source: Official Netscape JavaScript 1.2 Reference, 1998

BTW: Javascript per-se has not been updated since that time, as far as I can tell. The javascript engines on more recent browsers have been enhanced though, to accomodate the functionality provided by the Document Object Model.

convertSingleQuoteAndDoubleQuoteAsciiToCharacters

Leave a comment on “convertSingleQuoteAndDoubleQuoteAsciiToCharacters”