The Daily WTF: Curious Perversions in Information Technology

no laughing matter · 2012-07-04 Reply Admin

Cbuttius:
Tab is commonly used which is ok except that Microsoft Word seems to be the only text editor that allows you to find-replace using it

Wait what?!?

Cbuttius:
Microsoft Word seems to be the only text editor

OK, you clearly don't know what a text editor is!

(Hint: Notepad is not a text editor; as the name suggests, it's a pad for your notes; also Wordpad is not a text editor).

Just three examples of real text-editors: Ultraedit; Programmer's file editor; Notepad++.

A google search for "Windows text editor" will come up with lots and lots of alternatives.

2012-07-04 Reply Admin

rss is broken:
article breaks rss

Or RSS breaks article.

2012-07-04 Reply Admin

Nagesh:
Also, at a more principled level, UTF-8 has no business deciding that any codepoints Unicode defines are unworthy of being encoded. Its mandate is to be able to express any imaginable sequence of codepoints, and it takes that reasonably seriously.

Well, not UTF-8 then, just Unicode. Whatever.

Also also, it's very easy to think up scenarios where it could open a pretty serious security problem if UTF-8 decoders began producing spaces from unwanted byte combinations, or even removing them silently. Imagine someone thinking themselves secure because they have verified that the UTF-8 input doesn't contain ".." anywhere and then later convert the checked string to UTF-16, silently removing the spurious bytes in ".\x01./.\0x01./.\x01./etc/passwd"?

Oh, but that happens all the time today. You can trick such filters with percent-encoding, or other tricks. The solution is of course to actually test after it has been converted.

HIBT?

Probably not.

lucidfox · 2012-07-05 Reply Admin

Surely TRWTF is WCF.

2012-07-05 Reply Admin

PleegWat:
I wouldn't expect a chr() character in C#, but I'd expect "\a" or "\009" works?

There is a Chr() in VB.NET but there it does not exist in C# (also .NET).

In C#, char is a 2-byte integer that represents a UTF-16 Unicode codepoint.

To to use a hexadecimal escape sequence, you need to put the u:

"For whom the \u0007 tolls"

Or cast the codepoint to a char:

char myBel = (char)7 string myString = String.Format("For whom the {0} tolls", myBel);

2012-07-07 Reply Admin

UPDATE: Google Feedburner, our RSS feed host, apparently doesn't like BEL characters, so I removed it from the article in hopes that it will fix the broken feed.

Wait? They worked at Google?

2012-07-09 Reply Admin

Gurth:
The BEL character was originally used to cause an audible beep or buzz on terminals
No, it was originally used to ring the physical bell on a teletype. That is, a low-tech version of this.

Ack, I suddenly feel old for being around when you used BEL to get the attention of the Sysop (or for them to get your attention). (Post-physical bell, but we got the point).

2012-07-09 Reply Admin

Rootbeer:
The Real What TF is that they misused BEL as a delimiter when there's already an ASCII Unit Separator non-printable control character (0x1F) that fits the purpose exactly, right?

Now I want to use Emoji as delimiters.

2012-07-09 Reply Admin

A Gould:
Gurth:
The BEL character was originally used to cause an audible beep or buzz on terminals
No, it was originally used to ring the physical bell on a teletype. That is, a low-tech version of this.

Ack, I suddenly feel old for being around when you used BEL to get the attention of the Sysop (or for them to get your attention). (Post-physical bell, but we got the point).

WHADDAYA MEAN "POST"? I'M USING A TTY 33!

2012-07-12 Reply Admin

Cbuttius:
Aargle Zymurgy:
Wow... this many comments and not a peep about using CSV?

I did earlier you just weren't reading properly.

comma really is a very poor choice of delimiter.

There are altneratives that are humanly readable and typeable but rarely used in data, e.g. ` (very infrequently required) and | (usually infrequent).

Tab is commonly used which is ok except that Microsoft Word seems to be the only text editor that allows you to find-replace using it (you put in ^t for it), which often leads me to copy-pasting text into blank Word documents just to "process" it into tab-separated before copying it back (to Excel or wherever).

ç is a good delimiter character, no one uses it. yes, my native language has that abortion of character, i cleanse all my text of it before serialization

2012-07-15 Reply Admin

qbolec:
We have our own home-grown NoSQL database at nk, and it uses spaces for separation which provides even more fun with escaping and unescaping.
We also use our own scheme of serialization of table rows for cacheing which is uses \xFF as separator, which is quite unlikely to happen anywhere in UTF-8 strings or numbers (and most of coulmns are of either of these types), but in case it happens we escape it by doubling (\xFF becomes \xFF\xFF).

We thought for a minute about using \0 which also doesn't happen inside regular strings, but that would be problematic for almost every part of the stack.

I've never heard about unit separator ASCII character but sounds elegant, if it doesn't occur in UTF-8 strings to often.

I've heard about guys who were implementing some MMORPG internet protocol which chosen 37 (or some other odd byte) as a separator, which was found by experiment to be the least probable byte value. It was like 12 years ago, but I am still wondering WTF was wrong with them, their priorities, their architecture, their stats analyzis, and their data distribution.

No no no. Prefix (or postfix) the individual elements with a variable-length encoded integer. That way (1) you don't need to scan in order to parse and (2) no quoting is necessary. In-band signaling was a bad idea before phreakers figured out how to break it. Magic bytes aren't even quant or retro at this point, they're just broken. If you need a particular lexical order, then use 0x00 or 0xFF as delimiters and append multiple variable-length integers to the very end of your string, so you still don't need to quote magic bytes, or scan every byte in order to parse your data structure. This is faster and more robust than using delimiters.

2012-07-16 Reply Admin

Captcha:ideo:
ASCII is the perfect example of a standard that was designed with lots of features that are just unnecessary today (though I suppose they were used back then). The other example is HTTP (PUT, DELETE, OPTIONS, PATCH?).

Can't tell if serious or troll

For Whom the BEL Tolls

Leave a comment on “For Whom the BEL Tolls”