- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Sorry, forgot to quote
Admin
There's no such thing as an "ordinary character". Every binary representation of a "character", which should be referred to as a "glyph" in this context, is the binary representation of that glyph in a specific character set. If you go back far enough, you get IBM character sets that were 5-bit. These evolved into the 7-bit US-ASCII character set. Effectively every other character set in wide use today uses the 128 characters specified by US-ASCII as the "first" 128 characters (that is, binary 00000000 through 01000000 as the least significant byte, with all other bytes being either 00000000 or the extension byte variant thereof, depending on the character set).
US-ASCII is the only character set allowed by many original versions of most core internet protocols for commands, and often data, too (e.g. SMTP). The most commonly used character set for English operating systems and applications is ISO-8859-1 (a.k.a. Latin-1), which is an 8-bit character set mirroring US-ASCII for all values below 128, and providing 128 commonly-used (in English and Latin-based cultures) glyphs with values between 128 and 255.
Now, you do raise a good point: JavaScript may define, as its allowed character set, one of the Unicode Transformation Format (UTF) character sets, rather than ASCII. I don't know for sure. If so, I was incorrect in saying that the author of the code was trying to convert these HTML entities into ASCII. However, the fact remains that he was not trying to convert anything to or from Unicode. Moreover, I reiterate that there is no such thing as an "ordinary character".
Admin
I would like to celebrate this very first post in a long long time without a unclever captcha comment to be found anywhere. Well done. I can actually just enjoy the comments.
Admin
Ouch!! why did you have to say "TinyMCE"? Seriously, the code cited in the article is rather benign--it will remain lost in obscurity in whatever little application it's written for. TinyMCE, OTOH, is in the wild, converting and reconverting until it has turned users intent into garbage wherever it goes.
Admin
String does not have a replace(String, String) method. You example uses the replaceFirst method. If the original poster meant to use replace(char, char) then both examples would return the value of "AAAA".
Admin
Yes, there is - in JavaScript aka ECMAScript, a character is simply a string of length one. What we call ' here is the single quote character referred to in the ECMAScript standard.
ECMAScript requires all implementations to conform to Unicode as far as this is applicable. However, what character set or encoding an implementation uses internally is completely irrelevant.
Actually, since ECMAScript requires implementations to be Unicode conformant, and it's easy to convert JavaScript character to and from UCS codes, one could argue that he is converting these entities to UCS. Though that is technically not identical with Unicode, UCS is the character set defined by the Unicode Consortium and even the ECMAScript standard refers to UCS as the "Unicode character set" (technically it's the Universal Character Set).
Methinks you are yourself confused about the terms. UTF-X are not character sets, they are encodings. The Unicode standard defines sets of glyphs, a glyph being the thing you see in a browser, with some level of abstraction applied to it, i.e. Unicode considers ', ' and ' to be the same glyph in different typefaces. UCS as a character set defines code values for these glyphs. Various encodings like UTF-8 and UTF-16 typically define binary representations of those code values.
Admin
Admin
There is a conflict of terminology, here, exacerbated by ambiguous colloquial usage, which is confusing what would, likely, otherwise be agreement. My original objection was to the identification of "Unicode" as being the origin of a conversion. Misunderstanding, and the resultant misuse, of the term "Unicode" is a major peeve of mine, having spent over a year dealing with the vagaries and minefields of international email.
With regard to your statements, in large part I am in agreement. While it is more accurate, indeed, to use "character encoding" to refer to the Unicode Transfer Formats, given modern usage, the phrases "character encoding" and "character set" are effectively interchangable, subject to context. I chose "character set" as the more universally understandable phrase.
Furthermore, after reading the portion of the ECMAScript specification regarding lexical grammar, it is clear to me that it does call for implementations to honor the full range of the Unicode glyph repertoire, but places no restrictions on the character encoding mechanism used to implement that repertoire. This is unusual among standards, and I'm delighted to see it in this case. I still, however, object to the phrase "ordinary character". While your meaning is now clear to me, I insist that such phrases only further encourage ingrained misunderstandings of the vital distinction between a "character" and a "glyph". Pedantic? I wish it were; unfortunately, so many people misapprehend the concept of Unicode that "standard" implementations thereof are, often, horribly flawed (consider even recent versions of the ubiquitous AspNetEmail and AspNetMime which treated the UTF-16 byte sequences of .NET Unicode strings as ASCII byte sequences, without any transcoding). If those who understand Unicode, among whom you seem to be, continue to use colloquial parlance, we'll only encourage common misunderstanding.
Yes, I do believe accurate vocabulary is that vital. Just think back to your last requirements meeting.
P.S. I'm going to avoid any discussion of the ECMAScript specification's unfortunate (and inaccurate) backronym of "UCS".
Admin
Admin
Seeing as Unicode is in fact a character set, that would be a silly course of action.
That is the precise definition of 'character set'. A set of glyphs with a unique value assigned to each. For example, 'A' is assigned the value (a.k.a. 'code point') 65 in the ASCII character set, and also 65 in the Unicode character st. But it is assigned a different value in the EBCDIC character set.
Rubbish. For example, ASCII does not specify anything about how 'A' is to be represented, other than by the value 65. If you used ASCII on a ternary computer then it would not have a binary representation. UTF-x are character encoding schemes, which specify how you can use a sequence of octets to encode the value of a character in the Unicode character set. ASCII has nothing to do with this, the objective would be just the same on an EBCDIC system or a Unicode-like system (eg. Java). Because of people like you who have no idea what you are talking about, but try to sound authoritative.Admin
what ever happen to the Replace function?
function convertSingleQuoteAndDoubleQuoteAsciiToCharacters(str){ return Replace(Replace(str, "'", "'") , """, """); }
Admin
I presume you're talking about the javascript string object's replace function. un for to notly unless you want to get all regular expressions with that it will only replace the first instance of the supplied string.
Javascript is a very fun language. I'd compare it to running full speed through a minefield with only one eye.
Admin
No, you would be DoS'ing every poor soul who visited your page with JavaScript enabled.
Admin
it was not so long ago that javascript did not have a replace function and perhapse more importantly all searchs for "javascript replace function" returned "javscript does not have a replace function", of course that was before google
Admin
Actually wont the strings alternate between 'foo' and ''#39;
Still an infinate loop though.
Admin
Admin
I ran into a problem with the JavaScript .replace function where it would only replace the first instance of the string so replacing OMG with LOL: OMG you suck OMG -> LOL you suck OMG
Although a few minutes searching at google brought up the solution.
Admin
sorry wrong language, I though this was vb
Admin
Some fun:
Admin
Psst: (if ASP.NET) Server.HtmlEncode()
Admin
You missed the context of my comment, I was replying to a person that thought you could DoS them by visiting their site and causing an infinite loop. I was not commenting in the context of the app creators.
Admin
Well, I had to make a similar funcion this week to fix the "beautified" quotes and apostrophes inserted by Wordpress when converting a post to HTML.
The name of the function? fixWPgayness().
Admin
I donno. This is buggy obviously. The infinite loop should have been tested and found when they wrote the code since the only reason those loops are there is to handle the exact case they don't work for.
And of course they should use built in functions rather than reimplementing them since they obviously failed to reimplement them properly.
But this doesn't exactly make me cringe like many wtfs. It is fairly straightforward and easy to read. I can tell what it is doing and why and it is easy to see how it is broken. It is a reasonable, albeit inefficient way to do the task at hand if one didn't know of the 'replace' method. But an O(n*m) function isn't really that bad when n is typically less than 10 and m is typically 1 or 2. (n being the length of the string and m being the number of replacements +1)
Admin
Ahh, the classical confusion of in and output variables. The WTF is alaways that both exist. There is no trouble using
modify(String str) { .. return str; }
altough, is may be more understandable to use
modify(const String inputString) { String outputString = inputString; .. return outputString; }
alas, not neccessary in this case
Admin
I probably just missed the <sarcams> tags there, but just in case:
Method of the String object (see also replace as a method of location)
Replaces a specified substring with another.
Syntax
stringObj.replace(regexp, newText)
...
Source: Official Netscape JavaScript 1.2 Reference, 1998
BTW: Javascript per-se has not been updated since that time, as far as I can tell. The javascript engines on more recent browsers have been enhanced though, to accomodate the functionality provided by the Document Object Model.