• Hob (unregistered)

    Me no want neither.

    PS. No funny comments to read yet...

  • siclude (unregistered)

    wow. Sad thing is that I have seen worse. But... wow

  • Berend (unregistered)

    OMG - Where can I hire the boy-genius that came up with this? :)

  • (cs)

    Wouldn't it be great if it turned out that 'replace' and 'REPLACE' were two completely different functions? ;)

  • (cs)

    Wow, I'm a Complete Noob at SQL coding, and I've could of come up with something better then that, i wonder, is this PHP?

  • F (unregistered)

    Maybe some art of SQL obfuscation?

  • pnh (unregistered)

    i'm writing sql right now @ work and this makes me want to quit and go skydiving with no parachute.

    greets from Argentina, the bad-english country!

  • (cs) in reply to jrwr00
    jrwr00:
    Wow, I'm a Complete Noob at SQL coding, and I've could of come up with something better then that, i wonder, is this PHP?

    No, this is just SQL. SQL has functions too.

  • Suraj (unregistered)

    Well AFAIK there is no better way of doing character substitution in SQL. Unless you create a table & perform lookup from that table using a stored procedure.

  • diaphanein (unregistered) in reply to Suraj
    Suraj:
    Well AFAIK there is no better way of doing character substitution in SQL. Unless you create a table & perform lookup from that table using a stored procedure.

    I agree. Thing I cannot figure out is that this person seems to be replacing all control characters between 1 and 31, inclusive, w/ nulls. My guess is that somewhere, in either the translate or filter functions, these nulls are removed. I understand the desire to eliminate the control chars...but to do it in sql? What a pain in the arse.

  • serge (unregistered)

    I'm guessing this is Oracle, which only introduced regular expressions in 10g - so granted, in earlier versions this sort of thing can be annoying - but then such cleanup would normally be done in an input filter, not as an in-table update. At least use a UDF...

  • mav (unregistered)

    Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".

  • (cs) in reply to serge
    serge:
    I'm guessing this is Oracle, which only introduced regular expressions in 10g - so granted, in earlier versions this sort of thing can be annoying - but then such cleanup would normally be done in an input filter, not as an in-table update. At least use a UDF...

    Similar gotchas exist in MySQL.

  • (cs)

    If this is Oracle (kinda looks like), "translate" is a build-in function. Using a UDF would have been by far more beautiful, but it comes with a small performance penalty.

  • Marcin (unregistered)

    Can oracle not call out to c functions? If so, wouldn't character replacement be done more efficiently and comprehensibly there?

  • truthiness (unregistered) in reply to mav
    mav:
    Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".
    Replaces the first 31 chars ('\0', bell, tab, etc. The unprintables) with NULL.
  • (cs) in reply to mav

    This can't be SQL Server .. replacing a single character in a string (i.e., varchar) with a NULL results in NULL for the entire string:

    select replace('jeff', 'j',null)

    returns


    NULL

    (1 row(s) affected)

    As expected .... I mean, why would you think that you could replace a single character with NULL ? Makes no sense. If the goal is to take a string with ANY of these characters ANYWHERE and simply return NULL for the entire string, I guess it would work "properly", but I doubt that's it ...

  • maht (unregistered) in reply to Suraj

    get a better database

    REPLACE isn't an SQL95 standard call either

    postgresl has

    translate(string text, from text, to text) Any character in string that matches a character in the from set is replaced by the corresponding character in the to set

    e.g. translate('12345', '14', 'ax') a23x5

    so we have to translate any of \1-\31 with sentinel (I've chosen \1) and then replace that with our desired value (NULL for some reason)

    set name=replace(translate(new_name, '\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31', '\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1'), '\1', NULL);

    but this will output a NULL string if any of \1-\31 are encountered

  • thickasabrick (unregistered)

    Maybe that code was written in a contest to see who could call replace() the most.

  • (cs) in reply to Jeff S
    Jeff S:
    This can't be SQL Server .. replacing a single character in a string (i.e., varchar) with a NULL results in NULL for the entire string:
    I don't actually know which RDBMS was supposed to run this statement, but I'd venture the guess that REPLACE(text, substring, NULL) effectively strips all occurrences of the search string from the searched string on that database system.

    PostgreSQL would return NULL for that input just like SQL Server.

  • Ooter (unregistered) in reply to Jeff S

    I'd guess this is Oracle, too. Replacing a character with null will effectively delete that specific character from the string:

    select replace('abcdef','a',null) from dual;

    REPLA

    bcdef

    1 row selected.

  • TheUnknownCoder (unregistered)

    There's different ways to do that, but the most elegant solution would be something like:

    CREATE FUNCTION dbo.RemoveBadChars(@InputString VARCHAR(8000) = NULL)
     RETURNS @OutputString VARCHAR(8000) AS
     BEGIN
     /*
     You can either walk through each character and eliminate the ones
     not needed or do the nested replaces. You could also build your own
     extended procedures that uses some compiled DLL, but that could be
     an overkill depending on how this code is used.
     */
    
     SET @OutputString = REPLACE(REPLACE........**INSERT SCARY CODE HERE**.......))
     RETURN
    END
    

    Then, you can just call that ONE function for ALL columns. And since the functions are compiled optimized, the code will also run faster.

  • Anon (unregistered)

    This is when you need to write a (Perl/Python/PHP/C/whatever) program that writes your program. It's the only thing that will keep you out of the padded office if you have to, dog forbid, change it.

  • Randyd (unregistered)

    I think the function idea would be best - something along the lines of a series of replaces, each on a different line.

    however, the original writer may have been trying to avoid the need for create function or procedure privileges - so just jamming it all in one script leaves you with less options.

    if it is Oracle (i suspect so) then at least the script could be written as a PLSQL block - and it could look a little more like traditional code

    captcha = craptastic

  • (cs) in reply to maht
    maht:
    get a better database

    REPLACE isn't an SQL95 standard call either

    postgresl has

    translate(string text, from text, to text) Any character in string that matches a character in the from set is replaced by the corresponding character in the to set

    e.g. translate('12345', '14', 'ax') a23x5

    so we have to translate any of \1-\31 with sentinel (I've chosen \1) and then replace that with our desired value (NULL for some reason)

    set name=replace(translate(new_name, '\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31', '\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1'), '\1', NULL);

    but this will output a NULL string if any of \1-\31 are encountered

    In Oracle, the "translate" function does exactly the same; unfortunately, escaping special characters like '\1' doesn't work.

  • steve (unregistered)

    and god invented for loops for a reason...

  • (cs) in reply to Hob

    I don't see why anyone would design a table that had a record that would need to be updated like this. This code is just the tip of the WTF.

    Designing a database, you should have fields that change often, and fields that change rarely if at all. Records that change often should be foreign keyed to fields that change seldom, and when they need to be changed, they should be REPLACED, not abused with some crazy ass serial update stored procedure.

    Temporary items like semaphore implementations, should be deleted, or the entries should be moved to a log table or something. They should never be cleaned.

    If you do need to clean up your tables, it should be done by dates, which is easy as pie.

    This sort of thing is almost always the result of poor table design. It's just absurd.

  • DarthGeek (unregistered)

    In SQL Server:

    DECLARE @BadStrings VARCHAR(50) SET @BadStrings = 'Hello' + CHAR(5) + ' There'

    SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')


    That shoudl about do it.

  • Gnome (unregistered)

    I was going to put something witty here, however my captcha is "giggity" and that seems to sum it up pretty well.

  • Oraclie (unregistered)

    I dunno about you, but I would use a trigger that calls a cleanup stored procedure for each field... :new.name=fix(:new.name) ....etc???

  • RobertJohnK (unregistered)

    Data cleanup scripts should only be used once, and then only after every bit of data entry has been fixed to not insert undesired data in the first place.

  • Ross (unregistered)

    You people are morons. This is the only way to perform the required operation in sql without sacrificing performance. Not all programming is rainbows and sunshine. There will come days with the code gets ugly in oreder to perform. Get a grip, and let this one go.

  • (cs) in reply to DarthGeek
    DarthGeek:
    In SQL Server:

    DECLARE @BadStrings VARCHAR(50) SET @BadStrings = 'Hello' + CHAR(5) + ' There'

    SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')


    That shoudl about do it.

    It should about do it if you only want to replace only the very first "bad" character, and if you are defining bad characters as including punctuation and other things that you probably would want to keep.

  • JL (unregistered) in reply to Anon
    Anon:
    This is when you need to write a (Perl/Python/PHP/C/whatever) program that writes your program. It's the only thing that will keep you out of the padded office if you have to, dog forbid, change it.
    No kidding. For a moment I thought that this code was generated in such a way, but then I saw the random capitalization. I guess you'd want something like:
    s = columnName
    for i in range(1, 32):
        s = 'replace(%s,CHR(%d),NULL)' % (s, i)
    
  • (cs)

    In Oracle, replace('foo', null) is just a stupid way of writing replace('foo', ''), because in Oracle NULL and '' are the same thing. (Which is a WTF for another day...)

    There are (at least) three WTFs here:

    (1) They regularly have to run a script to remove this sort of thing from a well-respected data source (can't they just fix whatever keeps putting them in there, and run it once on the already-entered data?)

    (2) The writer of this fragment apparently didn't know how to write functions in SQL, though someone must have written "filter" and "translate". Even if this horrible expresssion were justified once, there's no reason to write it out once for each of the 32 columns.

    (3) They aren't using regular expressions, or built-in translate function correctly...they could at least do

    translate('x' || chr(1) || chr(2) || chr(3) || ... chr(31), 'x')
    which would strip the control characters without all the nested calls. (I guess their translate is something else...IIRC, Oracle would allow a user-defined translate/1 (one-argument translate) to co-exist with the builtin translate/3 (three-argument translate).

  • deroby (unregistered)

    Mind that this is clean-up code : write once, execute once!

    Not quite T-SQL I think, so I can't be sure on what other 'better' functions might be available; but apart from using a UDF (that holds more or less the same code, be it better 'hidden' and most probably a lot slower) I don't think there is much improvement to be made on this.

    From personal experience, more often than not you're given 'full' access (SUID) to a set of tables, but find yourself totally locked away from making your own objects (eg. functions) when doing ETL stuff, so the guy might have had to do what was available and did so nicely. Although it might not be the nicest thing to look at, it's correct and it's fast! Where's the WTF ?

    I'm pretty sure the author in question wrote a little script to generate given code (32 x 32 is A LOT of replaces =), if not, well hmm, THAT might be a WTF indeed =)

    PS: translate is a 'standard' SQL function according to "SQL in a nutshell", not quite sure I understand 'what it can look like' ?!

    Translate : Converts a string from one character set to another.

  • Gsquared (unregistered) in reply to DarthGeek
    DarthGeek:
    In SQL Server:

    DECLARE @BadStrings VARCHAR(50) SET @BadStrings = 'Hello' + CHAR(5) + ' There'

    SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')


    That shoudl about do it.

    That will eliminate the first instance of a bad character, but not more.

    For example:

    DECLARE @BadStrings VARCHAR(50) SET @BadStrings = 'Hello' + CHAR(5) + char(2)+' There'

    SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')

    (I added a second, different, undesired character)

    leaves a char(2) in the result.

    A UDF with a loop in it like:

    while patindex(@string, '%[-]%', @string) > 0 select @string = replace(@string, substring(@string, patindex(@string, '%[-]%', @string), 1), '')

    Will clean up the whole string. Performance isn't too bad, and it's much easier to maintain than 30+ nested replace() functions.

    (I hope the characters in my example show. I copied-and-pasted them into the editor here.)

  • (cs)

    deroby, you're wrong on all points. It sounds like they're running this regularly, it's not a question of access (someone of similar skill must have written filter and translate), and they're not using the built-in translate function...it's overloaded with their own. The performance penalty for user-defined functions obviously is not a concern if they're calling two of them and nesting 32 calls to a built-in (31 of which are unnecessary). They must not have written a script to generate this, since the case is not consistent.

  • Spencer (unregistered)

    And what if the feed is actually sending ISO-8859-1 or UTF-8? Then what? I would hope there's a better solution than enumerating all the "missing" characters in the Unicode space!

    Oracle supports in-DB Java since at least Oracle 8.

  • wpgDBA (unregistered) in reply to mav
    mav:
    Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".

    I lost interest a lot sooner, but yes, I can tell you what it does. It replaces in the "name" column any character with an ASCII value of 1 to 31 with a null value. In other words, it strips out most "control" characters.

    Of course, it COULD be written like this:

    variable SRCH_STR VARCHAR2(40)

    BEGIN :SRCH_STR := chr(1)||chr(2)||chr(3)||chr(4)||chr(5) ; :SRCHSTR := :SRCHSTR || chr(6)||chr(7)||chr(8)||chr(9)||chr(10) ; :SRCHSTR := :SRCHSTR || chr(11)||chr(12)||chr(13)||chr(14)||chr(15) ; :SRCHSTR := :SRCHSTR || chr(16)||chr(17)||chr(18)||chr(19)||chr(20) ; :SRCHSTR := :SRCHSTR || chr(21)||chr(22)||chr(23)||chr(24)||chr(25) ; :SRCHSTR := :SRCHSTR || chr(26)||chr(27)||chr(28)||chr(29)||chr(30)||chr(31) ; END; /

    UPDATE OWNER_USER.all_candidates SET name = filter(translate( dbms_standard.translate( name, 'a'||:SRCH_STR, 'a' ) )), ...

    This perversion probably occurred because somebody DID write their own TRANSLATE() function (therefore my "dbms_standard." junk).

    But I don't even want to go there ...

  • Steve (unregistered)

    Probably written by a DBA with a obsessive compulsive disorder

  • treypole (unregistered)

    The biggest WTF here is that it is being done in the database. Doesn't matter what RDBMS it is. This sort of string manipulation belongs in a scrubbing application. It's like building HTML in the database. Why?

  • (cs)

    The main problem is that it is done at the wrong level, it should of been implemented as a stored function, so that the 32 other columns could call it. Or it should of been part of the applications job to validate the columns. Also it could of been implemented as a trigger.

    The other problem is the way it was done. It could of been done like this (implemented using PostgreSQL, Note I've only known about Stored Procedures for a week, so if there is any little errors, blame it on lack of experience with them):

    CREATE OR REPLACE FUNCTION cleanUpCandidateColum(TEXT columnData)
          RETURNS TEXT
    BEGIN
       RETURN regexp_replace(columnData,'[\x1-\x1F]+', '', 'g');
    END;
    
  • Been there, done that (unregistered)

    I wouldn't be surprised if this code was actually generated by a program, maybe something which translates SQL queries from one database to another...

  • ChrisH (unregistered)

    I'm not going to read the comments.

    I just want to say.. "RegEx primer"

    PS this place has the funniest CAPTCHA, thanks :)

  • Da' Man (unregistered)

    I was working for a very large US-company (which actually should know better), and we did all the system integration in VBScript (!) - where any kind of necessary text encoding cleanup was done like this:

    out = replace(out, "ä", "ae") out = replace(out, "Ä", "Ae") ... out = replace(out, "é", "e") ...

    etc., etc.. hundreds of lines like this. This didn't even take into account that the replacement Ä -> Ae only works with some languages.

    Oh yes, and the target encoding was actually UTF-8...

    I quit there after half a year.

    DM

    Captcha: Java

  • APC (unregistered)

    I will admit to having written code which did something like this, for a routine which generated XML files from Oracle tables. You see, ASCII 0-8, 11-12, 14-31 are not valid XML characters, so I had to strip them out from the data. I just used a single REPLACE call which was nowhere near as ugly as this example.

    Cheers, APC

  • deroby (unregistered) in reply to slamb

    slamb: not going to start a YesNo war here. I simply wanted to point out that sometimes there are situations where you need 'do' with the tools you've been offered. Nested replaces are not that terrible IMHO. Sure it can be done nicer/faster/better/whatever, but without proper context I consider it a bit harsh on the author to call this a proper WTF.

    Makes me think of situations where you're sent to some company with the mission to convert some mapping from the "mainframe" to "the application server". You're fully briefed and prepared on the functional part, you know plenty of the app-servers architecture and best of all, you get the entire week to 'massage' a measly couple of ten-thousand lines into the new format, easy! Off course the local IT guys will love to help you out ("damn external overpaid consultant I don't have time for"), so they give you terminal access to a 'well-respected data source' (not familiar with AS400 ? Don't worry, it's self-explanatory). After 2 days lost you manage to write the files to disk and FTP them to your server where you then get a 'just-a-bit-more-than-public' account to manipulate and load the data again, only to find they're full of strange characters after you BCP the file in. So you start looking for problems and fix them one by one, hoping it will all be done before the end of the week because you don't want to miss your flight back home. Sometimes getting it working is a matter of economics (read: within the allotted time-frame and budget), not proper design. Sad but true. Does that make it a WTF, imho not, YMMV.

    FWIW :

    • Copy pasting this only takes only 5 iterations, casing seems to contradict scripting, but heck, I HOPE the other columns were cased identically ?!
    • The Translate() and Filter() functions make my 'not enough access' a moot point indeed, although it might be that he only added those afterwards and didn't want to break what already worked.
    • Although NOW I'd rather "loose" my first days fighting the local powers that be in order to get the proper tools for the task (eg. DTS or alike for a job like this), I've run into 'described' situations myself while I was a 'newbie' and 'getting it done' was what counted!
    • Although it's not said explicitly that this script runs on a regular basis, if they can than this is proof that he made it 'rerunnable', much better than those "manuscripts" left behind by "some experts" that explain in 18 steps using 15 tools how to redo the reload of the data
    • ETL can be hell.

    As a general note on this web-site : get real guys, there's way too much elitists (sp?) here. Sure it's cool to write "... I quit the week after that ...", but that doesn't make the problem go away. If everyone would react like that its no surprise that people get replaced by technical "know-nothings" with total lack of "coding-style" and maybe a funny accent, because at least THAT guy got something working the way "the boss" had requested. (sort off). Although I like to think the opposite too sometimes, coding is NOT a call, it's a JOB.

  • times1 (unregistered)

    So this is what happens when you have bad karma: you come back as a SQL programmer writing crap like this.

    And "repeated nearly verbatim for the 32 other columns in the table". That really is icing on the WTF cake.

  • anony-mouse (unregistered) in reply to Ross
    Ross:
    You people are morons. This is the only way to perform the required operation in sql without sacrificing performance. Not all programming is rainbows and sunshine. There will come days with the code gets ugly in oreder to perform. Get a grip, and let this one go.

    mmm, I love the way troll smells in the morning

Leave a comment on “The Data Cleanup”

Log In or post as a guest

Replying to comment #:

« Return to Article