• Welbog (cs)

    If only we could get ridoff those guys who like to say "first".

  • timb (unregistered)

    owned them good!

  • timb (unregistered)

    owned them good!

  • Mcoder (cs)

    Hey, what a lovely manual "quotation" system! But I'll still keep using prepared statements. At least they don't corrupt my data :)

  • ParkinT (cs)

    "ridoff", is that related to "rip-off"?

  • WIldpeaks (cs)

    So the wtf is the lame joke ?

  • Alex G. (unregistered)

    I don't really get it. Either that or I just look too much into it.

  • tiro (cs) in reply to Alex G.
    Alex G.:
    I don't really get it. Either that or I just look too much into it.

    It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

  • s (unregistered) in reply to tiro
    It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

    Except when you know the input shouldn't contain any of these characters. Usually the replacement is more heavyweight then. But there happen situations when the target system has magic characters which can't be quoted.

    This entry is not a WTF without context. At worst it's a "could be done better" practice.

  • Steve (unregistered) in reply to tiro
    tiro:
    . . . It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.
    It may be something of a tautology but the general case is not the specific case.

    Sometimes it makes perfect sense to just eat the character or replace it with something benign.

    It all depends on context.

  • loose (unregistered)

    Totally agree. The function does exactly what it says on the tin!

  • bonzombiekitty (cs) in reply to tiro
    tiro:
    Alex G.:
    I don't really get it. Either that or I just look too much into it.

    It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

    Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

    if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

    But I really can't complain, I've done things like that before because I was lazy.

    side note: the four backslashes thing always screws me up.

  • Nelle (unregistered)

    I don't get it ... Sorry, but I don't ... It may be funny or WTF, but it is far beyond my comprehension ... Perhaps when I pass a couple of Sun's Java Certification tests I may be able to laugh at this or to say WTF, but until then can someone please spell out what exactly is wrong with this peace of code ?

  • Licky Lindsay (cs) in reply to s
    s:
    Except when you know the input shouldn't contain any of these characters.

    In which case the input should be totally rejected, not "sanitized".

  • TheJasper (cs) in reply to Alex G.
    Alex G.:
    I don't really get it. Either that or I just look too much into it.

    Well, I think its just a little strange that the special characters gotten rid of are ' " and &. I would expect !@#$%^&*() or something to also be included. But then the submission says very little about what kinds of strings are being manipulated or for what purpose.

    I certainly don't get the 'clever' flavor text surrounding the code. It just feels like someone trying to be funny and falling flat on his face. It certainly doesn't explain the situation, unless the situation was that someone wanted to get rid of some special characters and the submitter thought that was bad.

  • s (unregistered) in reply to bonzombiekitty

    Regexp is almost always slower than straight-off character replace, even repeated 3 times. The original code is enormously more readable too.

    Sure if you can afford the comfort of discarding a piece of data and reporting an error instead, go ahead.

  • TheJasper (cs) in reply to bonzombiekitty
    bonzombiekitty:
    Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

    if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

    But I really can't complain, I've done things like that before because I was lazy.

    side note: the four backslashes thing always screws me up.

    I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.

  • Stephen Cochran (unregistered) in reply to Licky Lindsay
    In which case the input should be totally rejected, not "sanitized".
    There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.
  • loose (unregistered)

    It is all about context. The guy who wrote it may have had a boss that said "I'm sick of those goddam quotes and ampersands, get ridoff em!". In that case, the function is fine!

  • Markp (cs) in reply to Steve
    Steve:
    tiro:
    . . . It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.
    It may be something of a tautology but the general case is not the specific case.

    Sometimes it makes perfect sense to just eat the character or replace it with something benign.

    It all depends on context.

    Exactly, how do we know that the string isn't a ", ' or & delimited list of integers that the coder wants to generalize into a space-delimited list. That's obviously not likely, but man, we really can't comment on something as valid as this without any context.

  • Strider (unregistered) in reply to loose
    loose:
    Totally agree. The function does exactly what it says on the tin!

    Agreed This is the first time I have felt that this was not WTF worthy.
    I would write this code if I needed this functionality... although I'd never leave a typo in the comment, I'm too o.c. fourthat.

    CAPTCHA: poindexter - a person who is intelligent but socially inept; a nerd...hehesnort

  • shadowman (cs)

    Yeah, I'm not sure what's wrong here. Assuming the programmer had a good reason to remove single quotes, double quotes, and ampersands, that seems like a perfectly reasonable way to do it. Sure, I'm sure there are other ways, even more efficient ways to do it -- but the code itself really isn't bad or buggy or wtf'ey.

    So I'm guessing the problem lies in my assumption that there was a good reason to remove those things. But we really need more information here.

  • Independent (cs)

    I will really hate myself for what I'm about to say, but...

    The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".

  • Dan (unregistered) in reply to Licky Lindsay
    Licky Lindsay:
    s:
    Except when you know the input shouldn't contain any of these characters.

    In which case the input should be totally rejected, not "sanitized".

    What? It is this type of thought that keeps some systems from being extremely easy to use.

    A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???

  • .. (unregistered) in reply to TheJasper

    [quote user="TheJasper"] if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); } [/quote]

    I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.[/quote]

    A single-character search&replace is almost always faster than a regexp. A smart compiler in the above case will realize the first argument is a static string and statically compile the replacement code. A somewhat dumber compiler will assume the replacement expression -could- be a variable, and so will first pass the first argument through a regexp parser, then generate copy&replace code, then parse the value using said code, all during runtime.

    People think stuff like eval() or regexp replaces is harmless because it gets magically changed during original compilation. But what about regex=someExternalUserInput; somestring.replaceAll(regex,target); ?

  • Cassis (unregistered)
    Comment held for moderation.
  • .. (unregistered) in reply to Independent
    Independent:
    I will really hate myself for what I'm about to say, but...

    The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".

    If the WTF is a poor wording of a comment to a valid piece of code, then this story is Worse Than Failure.

  • Zen (unregistered)

    I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

  • akatherder (cs)

    If you're building a filename or directory in Windows you have to do "something" with the invalid characters. A space or underscore is probably your best bet.

    Although the code still doesn't make much sense.

  • anon (unregistered) in reply to Dan
    Dan:
    Licky Lindsay:
    s:
    Except when you know the input shouldn't contain any of these characters.

    In which case the input should be totally rejected, not "sanitized".

    What? It is this type of thought that keeps some systems from being extremely easy to use.

    A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???

    Context is always the key. Sometimes sanitizing by dropping/replacing characters is not really a valid option. In those situations one should reject the input (if we're not going to escape it) If this was a password value, we would not want silently to turn a password of &'""''&&" into a bunch of spaces.

    I agree this may not be best way to deal with special characters, or maybe not the most efficient. But without know what they are doing with these strings or why they decided these characters are bad, ti's hardly a wtf

  • Art (unregistered) in reply to bonzombiekitty
    bonzombiekitty:
    tiro:
    Alex G.:
    I don't really get it. Either that or I just look too much into it.

    It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

    Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

    if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

    But I really can't complain, I've done things like that before because I was lazy.

    side note: the four backslashes thing always screws me up.

    A test run on my PC doing the same operation 10000000 times showed that the character replace outperforms the regexp by a factor of 10 to 1.

    I would say the same about readability. The fact that you mistyped it in your post (missing a backslash) says something.

  • Thief^ (cs) in reply to Zen
    Zen:
    I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.
    I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.
  • s (unregistered) in reply to Zen

    "cash&items".match(/shit/)

  • BA (unregistered) in reply to Stephen Cochran
    Stephen Cochran:
    In which case the input should be totally rejected, not "sanitized".
    There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

    Agreed. As an example consider the many ways a phone number can be written down: 800 555 1234 800.555.1234 (800) 555-1234 800-555-1234 555-1234 555.1234

    And the internal format the application expects... 800/555-1234

    The answer in this particular case would be: Sanitize when possible, reject when not.

    And the real WTF is ragging on people for misspelling in comments when the meaning and intention are perfectly clear. Maybe the guy is ESL and isn't familiar with English idioms.

    For shame worsethanfailure.com, for shame.

  • s (unregistered) in reply to Thief^
    Thief^:
    Zen:
    I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.
    I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.

    I've used a system that truncated the password to 8 characters. I was quite annoyed when after a year of using the system just fine, typing my 12-letter password with special characters at the end I learned my password was really a common dictionary word, all-lowercase. Still beats the hell of a system which required the password to contain at least 2 digits, at least 5 lowercase characters, no other characters than [0-9a-z] and be between 8 and 10 characters long, then refuse anything else.

    A basic place for use of the OP code would be a simple search engine entry preprocessing. Remove the quotes after switching the "sentence mode" on, remove '&' as AND operator which is default, leave spaces because otherwise the remaining words would get concatenated.

  • Alan (unregistered)

    I remember back in the bad old days when I used vb4, access and csv. Quotes were the bane of my life and no matter the amount of times we tried to filter them out, some function somewhere would let em through and screw up the whole system.

    Og course we just told off the users - "How dare you use a quote in your item descriptions - its your fault the application lost all your data"

  • AssimilatedByBorg (cs) in reply to Stephen Cochran
    Stephen Cochran:
    In which case the input should be totally rejected, not "sanitized".
    There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.
    OTOH, what really annoys me is systems that "sanitize" input for no apparent reason. e.g., removes a hyphen from a hyphenated name, or lowercases all but the first letter in a name that has mixed capitalization.

    I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...

  • dkf (unregistered) in reply to WIldpeaks
    WIldpeaks:
    So the wtf is the lame joke ?
    Either that or Derrick Pallas's lame choice of article. (You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)
  • Luke (unregistered) in reply to TheJasper

    I wouldn't be surprised if the compiler could even optimise it to the same thing.

    I would be surprised. And impressed.

    I'm guessing that the direct character replace version (the one featured) is faster, but I'd be willing to bet that if the regex were precompiled then the regex would be faster by a factor of about 3.

    As though it mattered. This isn't the kind of site where we are supposed to be picking over people's code for being cycles slower! Let's see some bad code snippets. Give me exponential time solutions to log (or constant) time problems! :-) I can go to YouTube if I want to see people who can't spell.

    Luke

  • akatherder (cs)

    The Sparc workstations at my University had the same password weakness where it only looked at the first 8 characters. I proclaimed my password un-hackable to my roommate. When he wasn't looking I copied my password. Then I told him to watch. I pasted my password then proceeded to slam on keys for about 5 full seconds. Lo and behold it worked when I hit enter.

  • Luke (unregistered) in reply to dkf

    (You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)

    Nah. Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.

    Luke

  • dkf (unregistered) in reply to AssimilatedByBorg
    AssimilatedByBorg:
    I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...
    You are aware that capitalization rules are locale-specific? That means if you work with the same data in the USA, the Netherlands and Turkey, you run the risk of getting three different answers if you don't take care...
  • Uberbandit (unregistered) in reply to bonzombiekitty
    bonzombiekitty:
    tiro:
    Alex G.:
    I don't really get it. Either that or I just look too much into it.

    It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

    Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

    if(a_sValue != null){ return a_sValue.replaceAll("[\\'|\\"|&|]", " "); }

    But I really can't complain, I've done things like that before because I was lazy.

    side note: the four backslashes thing always screws me up.

    Why the last | ? I'm not very familiriazed with Java but Shouldn't it be: return a_sValue.replaceAll("[\\'|\\"|&]", " ");

    CAPTCHA: Again "Slashbot"? I filled my quota of creativity with you last time.

  • travisowens (cs)

    For those who couldn't figure it out (in order of WTF-ness):

    1. He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.

    2. He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)

    3. He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.

    4. If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

    PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.

  • travisowens (cs)

    Oh I forgot the Javascript syntax for RegEx.Replace. A MAJOR mistake he made is that global replace is not enabled, so he replaces only the first naughty char of each type, but no further repeating chars.

  • dkf (unregistered) in reply to Luke
    Luke:
    Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.
    Regexps can do better when you are scanning for a long-enough multi-character string (the compiled RE should be able to take advantage of Boyer-Moore matching, though I suspect many RE compilers don't bother) but when you're just doing simple search for a single character, brute force works really well on modern hardware because it works with cache predictor units very nicely. This is the only way in which I can explain the fact that when I've tried out various ways of doing search/replace on strings in the past, the stupid brute-force approach worked well enough that it wasn't worth trying something more sophisticated and harder to maintain.

    Now, if the original code had been doing the opposite (a complex RE to perform a simple one-character-for-one-character replacement) then we'd have had our WTF! for sure on the grounds of that being a stupidly expensive way to do a trivial operation. We'd have also known for sure that the original author was proving that (s)he could write Perl in any language.

  • xeen (unregistered)

    I think the wtf here is that the double quote is escaped although it shouldn't be.

  • Bart (unregistered)

    Without context, I don't see the wtf. It's not in javascript and the comment clearly states that the function removes "some" special characters. Never claimed to remove them all.

  • Art (unregistered) in reply to travisowens
    travisowens:
    For those who couldn't figure it out (in order of WTF-ness):
    1. He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.

    2. He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)

    3. He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.

    4. If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

    PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.

    1. That is not JavaScript
    2. Since we don't know the context, we don't know which characters can do what damage
    3. See 2
    4. Who says HTML codes are harmless our should be allowed

    Until we know what the return value should be used for and where the input value comes from, I see nothing wrong with that code.

  • Justin (unregistered)

    travisowens: QFT

    on another note: we're fond of doing all sorts of bizarre character replacements/sanitization here. We really like: ' to become ` And we can't handle , because we do naive CSV parsing And we translate = into ~ in some places because we're passing x=y type strings as values in GET or POST contexts z=x~y

    It makes me giggle every time one of those comes back to bite us in the ass. I especially liked when we just removed ' from people's names, changing O'Reilly into Oreilly. I'm sure they loved that.

Leave a comment on “Eliminating Shady Characters”

Log In or post as a guest

Replying to comment #:

« Return to Article