The Daily WTF: Curious Perversions in Information Technology

Welbog · 2007-04-27 Reply Admin

If only we could get ridoff those guys who like to say "first".

2007-04-27 Reply Admin

owned them good!

2007-04-27 Reply Admin

owned them good!

Mcoder · 2007-04-27 Reply Admin

Hey, what a lovely manual "quotation" system! But I'll still keep using prepared statements. At least they don't corrupt my data :)

ParkinT · 2007-04-27 Reply Admin

"ridoff", is that related to "rip-off"?

WIldpeaks · 2007-04-27 Reply Admin

So the wtf is the lame joke ?

2007-04-27 Reply Admin

I don't really get it. Either that or I just look too much into it.

tiro · 2007-04-27 Reply Admin

Alex G.:
I don't really get it. Either that or I just look too much into it.

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

2007-04-27 Reply Admin

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

Except when you know the input shouldn't contain any of these characters. Usually the replacement is more heavyweight then. But there happen situations when the target system has magic characters which can't be quoted.

This entry is not a WTF without context. At worst it's a "could be done better" practice.

2007-04-27 Reply Admin

tiro:
. . . It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

It may be something of a tautology but the general case is not the specific case.

Sometimes it makes perfect sense to just eat the character or replace it with something benign.

It all depends on context.

2007-04-27 Reply Admin

Totally agree. The function does exactly what it says on the tin!

bonzombiekitty · 2007-04-27 Reply Admin

tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.

2007-04-27 Reply Admin

I don't get it ... Sorry, but I don't ... It may be funny or WTF, but it is far beyond my comprehension ... Perhaps when I pass a couple of Sun's Java Certification tests I may be able to laugh at this or to say WTF, but until then can someone please spell out what exactly is wrong with this peace of code ?

Licky Lindsay · 2007-04-27 Reply Admin

s:
Except when you know the input shouldn't contain any of these characters.

In which case the input should be totally rejected, not "sanitized".

TheJasper · 2007-04-27 Reply Admin

Alex G.:
I don't really get it. Either that or I just look too much into it.

Well, I think its just a little strange that the special characters gotten rid of are ' " and &. I would expect !@#$%^&*() or something to also be included. But then the submission says very little about what kinds of strings are being manipulated or for what purpose.

I certainly don't get the 'clever' flavor text surrounding the code. It just feels like someone trying to be funny and falling flat on his face. It certainly doesn't explain the situation, unless the situation was that someone wanted to get rid of some special characters and the submitter thought that was bad.

2007-04-27 Reply Admin

Regexp is almost always slower than straight-off character replace, even repeated 3 times. The original code is enormously more readable too.

Sure if you can afford the comfort of discarding a piece of data and reporting an error instead, go ahead.

TheJasper · 2007-04-27 Reply Admin

bonzombiekitty:
Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:
if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.

I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.

2007-04-27 Reply Admin

In which case the input should be totally rejected, not "sanitized".

There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

2007-04-27 Reply Admin

It is all about context. The guy who wrote it may have had a boss that said "I'm sick of those goddam quotes and ampersands, get ridoff em!". In that case, the function is fine!

Markp · 2007-04-27 Reply Admin

Steve:
tiro:
. . . It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.
It may be something of a tautology but the general case is not the specific case.
Sometimes it makes perfect sense to just eat the character or replace it with something benign.

It all depends on context.

Exactly, how do we know that the string isn't a ", ' or & delimited list of integers that the coder wants to generalize into a space-delimited list. That's obviously not likely, but man, we really can't comment on something as valid as this without any context.

2007-04-27 Reply Admin

loose:
Totally agree. The function does exactly what it says on the tin!

Agreed This is the first time I have felt that this was not WTF worthy.
I would write this code if I needed this functionality... although I'd never leave a typo in the comment, I'm too o.c. fourthat.

CAPTCHA: poindexter - a person who is intelligent but socially inept; a nerd...hehesnort

shadowman · 2007-04-27 Reply Admin

Yeah, I'm not sure what's wrong here. Assuming the programmer had a good reason to remove single quotes, double quotes, and ampersands, that seems like a perfectly reasonable way to do it. Sure, I'm sure there are other ways, even more efficient ways to do it -- but the code itself really isn't bad or buggy or wtf'ey.

So I'm guessing the problem lies in my assumption that there was a good reason to remove those things. But we really need more information here.

Independent · 2007-04-27 Reply Admin

I will really hate myself for what I'm about to say, but...

The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".

2007-04-27 Reply Admin

Licky Lindsay:
s:
Except when you know the input shouldn't contain any of these characters.

In which case the input should be totally rejected, not "sanitized".

What? It is this type of thought that keeps some systems from being extremely easy to use.

A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???

2007-04-27 Reply Admin

[quote user="TheJasper"] if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); } [/quote]

I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.[/quote]

A single-character search&replace is almost always faster than a regexp. A smart compiler in the above case will realize the first argument is a static string and statically compile the replacement code. A somewhat dumber compiler will assume the replacement expression -could- be a variable, and so will first pass the first argument through a regexp parser, then generate copy&replace code, then parse the value using said code, all during runtime.

People think stuff like eval() or regexp replaces is harmless because it gets magically changed during original compilation. But what about regex=someExternalUserInput; somestring.replaceAll(regex,target); ?

2007-04-27 Reply Admin

I had a similar experience @ work

we were pushing XML files to a server that was not under our control, and the software was not XML compliant...

following the request of the customer, we had to replace ">" by "greater than", "<" by "lower than", and so on in the text nodes...

that's the best way for sure, fuck the standards !

2007-04-27 Reply Admin

Independent:
I will really hate myself for what I'm about to say, but...
The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".

If the WTF is a poor wording of a comment to a valid piece of code, then this story is Worse Than Failure.

2007-04-27 Reply Admin

I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

akatherder · 2007-04-27 Reply Admin

If you're building a filename or directory in Windows you have to do "something" with the invalid characters. A space or underscore is probably your best bet.

Although the code still doesn't make much sense.

2007-04-27 Reply Admin

Dan:
Licky Lindsay:
s:
Except when you know the input shouldn't contain any of these characters.

In which case the input should be totally rejected, not "sanitized".

What? It is this type of thought that keeps some systems from being extremely easy to use.

A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???

Context is always the key. Sometimes sanitizing by dropping/replacing characters is not really a valid option. In those situations one should reject the input (if we're not going to escape it) If this was a password value, we would not want silently to turn a password of &'""''&&" into a bunch of spaces.

I agree this may not be best way to deal with special characters, or maybe not the most efficient. But without know what they are doing with these strings or why they decided these characters are bad, ti's hardly a wtf

2007-04-27 Reply Admin

bonzombiekitty:
tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.

A test run on my PC doing the same operation 10000000 times showed that the character replace outperforms the regexp by a factor of 10 to 1.

I would say the same about readability. The fact that you mistyped it in your post (missing a backslash) says something.

Thief^ · 2007-04-27 Reply Admin

Zen:
I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.

2007-04-27 Reply Admin

"cash&items".match(/shit/)

2007-04-27 Reply Admin

Stephen Cochran:
In which case the input should be totally rejected, not "sanitized".
There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

Agreed. As an example consider the many ways a phone number can be written down: 800 555 1234 800.555.1234 (800) 555-1234 800-555-1234 555-1234 555.1234

And the internal format the application expects... 800/555-1234

The answer in this particular case would be: Sanitize when possible, reject when not.

And the real WTF is ragging on people for misspelling in comments when the meaning and intention are perfectly clear. Maybe the guy is ESL and isn't familiar with English idioms.

For shame worsethanfailure.com, for shame.

2007-04-27 Reply Admin

Thief^:
Zen:
I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.
I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.

I've used a system that truncated the password to 8 characters. I was quite annoyed when after a year of using the system just fine, typing my 12-letter password with special characters at the end I learned my password was really a common dictionary word, all-lowercase. Still beats the hell of a system which required the password to contain at least 2 digits, at least 5 lowercase characters, no other characters than [0-9a-z] and be between 8 and 10 characters long, then refuse anything else.

A basic place for use of the OP code would be a simple search engine entry preprocessing. Remove the quotes after switching the "sentence mode" on, remove '&' as AND operator which is default, leave spaces because otherwise the remaining words would get concatenated.

2007-04-27 Reply Admin

I remember back in the bad old days when I used vb4, access and csv. Quotes were the bane of my life and no matter the amount of times we tried to filter them out, some function somewhere would let em through and screw up the whole system.

Og course we just told off the users - "How dare you use a quote in your item descriptions - its your fault the application lost all your data"

AssimilatedByBorg · 2007-04-27 Reply Admin

Stephen Cochran:
In which case the input should be totally rejected, not "sanitized".
There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

OTOH, what really annoys me is systems that "sanitize" input for no apparent reason. e.g., removes a hyphen from a hyphenated name, or lowercases all but the first letter in a name that has mixed capitalization.

I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...

2007-04-27 Reply Admin

WIldpeaks:
So the wtf is the lame joke ?

Either that or Derrick Pallas's lame choice of article. (You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)

2007-04-27 Reply Admin

I wouldn't be surprised if the compiler could even optimise it to the same thing.

I would be surprised. And impressed.

I'm guessing that the direct character replace version (the one featured) is faster, but I'd be willing to bet that if the regex were precompiled then the regex would be faster by a factor of about 3.

As though it mattered. This isn't the kind of site where we are supposed to be picking over people's code for being cycles slower! Let's see some bad code snippets. Give me exponential time solutions to log (or constant) time problems! :-) I can go to YouTube if I want to see people who can't spell.

Luke

akatherder · 2007-04-27 Reply Admin

The Sparc workstations at my University had the same password weakness where it only looked at the first 8 characters. I proclaimed my password un-hackable to my roommate. When he wasn't looking I copied my password. Then I told him to watch. I pasted my password then proceeded to slam on keys for about 5 full seconds. Lo and behold it worked when I hit enter.

2007-04-27 Reply Admin

(You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)

Nah. Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.

Luke

2007-04-27 Reply Admin

AssimilatedByBorg:
I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...

You are aware that capitalization rules are locale-specific? That means if you work with the same data in the USA, the Netherlands and Turkey, you run the risk of getting three different answers if you don't take care...

2007-04-27 Reply Admin

bonzombiekitty:
tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.

Why the last | ? I'm not very familiriazed with Java but Shouldn't it be: return a_sValue.replaceAll("[\\'|\\"|&]", " ");

CAPTCHA: Again "Slashbot"? I filled my quota of creativity with you last time.

travisowens · 2007-04-27 Reply Admin

For those who couldn't figure it out (in order of WTF-ness):

He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.
He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)
He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.
If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.

travisowens · 2007-04-27 Reply Admin

Oh I forgot the Javascript syntax for RegEx.Replace. A MAJOR mistake he made is that global replace is not enabled, so he replaces only the first naughty char of each type, but no further repeating chars.

2007-04-27 Reply Admin

Luke:
Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.

Regexps can do better when you are scanning for a long-enough multi-character string (the compiled RE should be able to take advantage of Boyer-Moore matching, though I suspect many RE compilers don't bother) but when you're just doing simple search for a single character, brute force works really well on modern hardware because it works with cache predictor units very nicely. This is the only way in which I can explain the fact that when I've tried out various ways of doing search/replace on strings in the past, the stupid brute-force approach worked well enough that it wasn't worth trying something more sophisticated and harder to maintain.

Now, if the original code had been doing the opposite (a complex RE to perform a simple one-character-for-one-character replacement) then we'd have had our WTF! for sure on the grounds of that being a stupidly expensive way to do a trivial operation. We'd have also known for sure that the original author was proving that (s)he could write Perl in any language.

2007-04-27 Reply Admin

I think the wtf here is that the double quote is escaped although it shouldn't be.

2007-04-27 Reply Admin

Without context, I don't see the wtf. It's not in javascript and the comment clearly states that the function removes "some" special characters. Never claimed to remove them all.

2007-04-27 Reply Admin

travisowens:
For those who couldn't figure it out (in order of WTF-ness):

He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.

He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)

He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.

If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.

That is not JavaScript
Since we don't know the context, we don't know which characters can do what damage
See 2
Who says HTML codes are harmless our should be allowed

Until we know what the return value should be used for and where the input value comes from, I see nothing wrong with that code.

2007-04-27 Reply Admin

travisowens: QFT

on another note: we're fond of doing all sorts of bizarre character replacements/sanitization here. We really like: ' to become ` And we can't handle , because we do naive CSV parsing And we translate = into ~ in some places because we're passing x=y type strings as values in GET or POST contexts z=x~y

It makes me giggle every time one of those comes back to bite us in the ass. I especially liked when we just removed ' from people's names, changing O'Reilly into Oreilly. I'm sure they loved that.

Eliminating Shady Characters

Leave a comment on “Eliminating Shady Characters”