Comment On The Data Cleanup

Ralph B. submitted today's scary code snippet, found in a script used for "cleaning up" data from a widely used and well-respected data source for use in another widely used and well-respected data source. [expand full text]
« PrevPage 1 | Page 2Next »

Re: The Data Cleanup

2007-01-08 09:03 • by Hob (unregistered)
Me no want neither.

PS. No funny comments to read yet...

Re: The Data Cleanup

2007-01-08 09:04 • by siclude (unregistered)
wow. Sad thing is that I have seen worse. But... wow

Re: The Data Cleanup

2007-01-08 09:07 • by Berend (unregistered)
OMG - Where can I hire the boy-genius that came up with this? :)

Re: The Data Cleanup

2007-01-08 09:14 • by DaveK
Wouldn't it be great if it turned out that 'replace' and 'REPLACE' were two completely different functions? ;)

Re: The Data Cleanup

2007-01-08 09:15 • by jrwr00
Wow, I'm a Complete Noob at SQL coding, and I've could of come up with something better then that, i wonder, is this PHP?

Re: The Data Cleanup

2007-01-08 09:17 • by F (unregistered)
Maybe some art of SQL obfuscation?

Re: The Data Cleanup

2007-01-08 09:20 • by pnh (unregistered)
i'm writing sql right now @ work and this makes me want to quit and go skydiving with no parachute.

greets from Argentina, the bad-english country!

Re: The Data Cleanup

2007-01-08 09:21 • by The MAZZTer
110256 in reply to 110253
jrwr00:
Wow, I'm a Complete Noob at SQL coding, and I've could of come up with something better then that, i wonder, is this PHP?


No, this is just SQL. SQL has functions too.

Re: The Data Cleanup

2007-01-08 09:21 • by Suraj (unregistered)
Well AFAIK there is no better way of doing character substitution in SQL. Unless you create a table & perform lookup from that table using a stored procedure.

Re: The Data Cleanup

2007-01-08 09:28 • by diaphanein (unregistered)
110258 in reply to 110257
Suraj:
Well AFAIK there is no better way of doing character substitution in SQL. Unless you create a table & perform lookup from that table using a stored procedure.


I agree. Thing I cannot figure out is that this person seems to be replacing all control characters between 1 and 31, inclusive, w/ nulls. My guess is that somewhere, in either the translate or filter functions, these nulls are removed. I understand the desire to eliminate the control chars...but to do it in sql? What a pain in the arse.

Re: The Data Cleanup

2007-01-08 09:35 • by serge (unregistered)
I'm guessing this is Oracle, which only introduced regular expressions in 10g - so granted, in earlier versions this sort of thing can be annoying - but then such cleanup would normally be done in an input filter, not as an in-table update. At least use a UDF...

Re: The Data Cleanup

2007-01-08 09:37 • by mav (unregistered)
Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".

Re: The Data Cleanup

2007-01-08 09:39 • by mkb
110262 in reply to 110260
serge:
I'm guessing this is Oracle, which only introduced regular expressions in 10g - so granted, in earlier versions this sort of thing can be annoying - but then such cleanup would normally be done in an input filter, not as an in-table update. At least use a UDF...


Similar gotchas exist in MySQL.

Re: The Data Cleanup

2007-01-08 09:45 • by ammoQ
If this is Oracle (kinda looks like), "translate" is a build-in function. Using a UDF would have been by far more beautiful, but it comes with a small performance penalty.

Re: The Data Cleanup

2007-01-08 10:05 • by Marcin (unregistered)
Can oracle not call out to c functions? If so, wouldn't character replacement be done more efficiently and comprehensibly there?

Re: The Data Cleanup

2007-01-08 10:06 • by truthiness (unregistered)
110269 in reply to 110261
mav:
Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".

Replaces the first 31 chars ('\0', bell, tab, etc. The unprintables) with NULL.

Re: The Data Cleanup

2007-01-08 10:10 • by Jeff S
110272 in reply to 110261
This can't be SQL Server .. replacing a single character in a string (i.e., varchar) with a NULL results in NULL for the entire string:

select replace('jeff', 'j',null)

returns

---------------
NULL

(1 row(s) affected)

As expected .... I mean, why would you think that you could replace a single character with NULL ? Makes no sense. If the goal is to take a string with ANY of these characters ANYWHERE and simply return NULL for the entire string, I guess it would work "properly", but I doubt that's it ...

Re: The Data Cleanup

2007-01-08 10:29 • by maht (unregistered)
110274 in reply to 110257
get a better database

REPLACE isn't an SQL95 standard call either

postgresl has

translate(string text, from text, to text)
Any character in string that matches a character in the from set is replaced by the corresponding character in the to set

e.g. translate('12345', '14', 'ax') a23x5


so we have to translate any of \1-\31 with sentinel (I've chosen \1) and then replace that with our desired value (NULL for some reason)

set name=replace(translate(new_name, '\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31', '\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1'), '\1', NULL);

but this will output a NULL string if any of \1-\31 are encountered

Re: The Data Cleanup

2007-01-08 10:31 • by thickasabrick (unregistered)
Maybe that code was written in a contest to see who could call replace() the most.

Re: The Data Cleanup

2007-01-08 10:35 • by Michael Buschbeck
110276 in reply to 110272
Jeff S:
This can't be SQL Server .. replacing a single character in a string (i.e., varchar) with a NULL results in NULL for the entire string:

I don't actually know which RDBMS was supposed to run this statement, but I'd venture the guess that REPLACE(text, substring, NULL) effectively strips all occurrences of the search string from the searched string on that database system.

PostgreSQL would return NULL for that input just like SQL Server.

Re: The Data Cleanup

2007-01-08 10:39 • by Ooter (unregistered)
110277 in reply to 110272
I'd guess this is Oracle, too. Replacing a character with null will effectively delete that specific character from the string:

select replace('abcdef','a',null) from dual;

REPLA
-----
bcdef

1 row selected.

Re: The Data Cleanup

2007-01-08 10:55 • by TheUnknownCoder (unregistered)
There's different ways to do that, but the most elegant solution would be something like:


CREATE FUNCTION dbo.RemoveBadChars(@InputString VARCHAR(8000) = NULL)
RETURNS @OutputString VARCHAR(8000) AS
BEGIN
/*
You can either walk through each character and eliminate the ones
not needed or do the nested replaces. You could also build your own
extended procedures that uses some compiled DLL, but that could be
an overkill depending on how this code is used.
*/

SET @OutputString = REPLACE(REPLACE........**INSERT SCARY CODE HERE**.......))
RETURN
END


Then, you can just call that ONE function for ALL columns. And since the functions are compiled optimized, the code will also run faster.

Re: The Data Cleanup

2007-01-08 11:05 • by Anon (unregistered)
This is when you need to write a (Perl/Python/PHP/C/whatever) program that writes your program. It's the only thing that will keep you out of the padded office if you have to, dog forbid, change it.

Re: The Data Cleanup

2007-01-08 11:12 • by Randyd (unregistered)
I think the function idea would be best - something along the lines of a series of replaces, each on a different line.

however, the original writer may have been trying to avoid the need for create function or procedure privileges - so just jamming it all in one script leaves you with less options.

if it is Oracle (i suspect so) then at least the script could be written as a PLSQL block - and it could look a little more like traditional code

captcha = craptastic

Re: The Data Cleanup

2007-01-08 11:56 • by ammoQ
110288 in reply to 110274
maht:
get a better database

REPLACE isn't an SQL95 standard call either

postgresl has

translate(string text, from text, to text)
Any character in string that matches a character in the from set is replaced by the corresponding character in the to set

e.g. translate('12345', '14', 'ax') a23x5


so we have to translate any of \1-\31 with sentinel (I've chosen \1) and then replace that with our desired value (NULL for some reason)

set name=replace(translate(new_name, '\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31', '\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1'), '\1', NULL);

but this will output a NULL string if any of \1-\31 are encountered



In Oracle, the "translate" function does exactly the same; unfortunately, escaping special characters like '\1' doesn't work.

Re: The Data Cleanup

2007-01-08 12:15 • by steve (unregistered)
and god invented for loops for a reason...

Re: The Data Cleanup

2007-01-08 12:19 • by Satanicpuppy
110292 in reply to 110248
I don't see why anyone would design a table that had a record that would need to be updated like this. This code is just the tip of the WTF.

Designing a database, you should have fields that change often, and fields that change rarely if at all. Records that change often should be foreign keyed to fields that change seldom, and when they need to be changed, they should be REPLACED, not abused with some crazy ass serial update stored procedure.

Temporary items like semaphore implementations, should be deleted, or the entries should be moved to a log table or something. They should never be *cleaned*.

If you do need to clean up your tables, it should be done by dates, which is easy as pie.

This sort of thing is almost always the result of poor table design. It's just absurd.

Re: The Data Cleanup

2007-01-08 12:31 • by DarthGeek (unregistered)
In SQL Server:

DECLARE @BadStrings VARCHAR(50)
SET @BadStrings = 'Hello' + CHAR(5) + ' There'

SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')

-----

That shoudl about do it.

Re: The Data Cleanup

2007-01-08 12:37 • by Gnome (unregistered)
I was going to put something witty here, however my captcha is "giggity" and that seems to sum it up pretty well.

Re: The Data Cleanup

2007-01-08 12:56 • by Oraclie (unregistered)
I dunno about you, but I would use a trigger that calls a cleanup stored procedure for each field... :new.name=fix(:new.name) ....etc???

Re: The Data Cleanup

2007-01-08 13:10 • by RobertJohnK (unregistered)
Data cleanup scripts should only be used once, and then only after every bit of data entry has been fixed to not insert undesired data in the first place.

Re: The Data Cleanup

2007-01-08 13:37 • by Ross (unregistered)
You people are morons. This is the only way to perform the required operation in sql without sacrificing performance. Not all programming is rainbows and sunshine. There will come days with the code gets ugly in oreder to perform. Get a grip, and let this one go.

Re: The Data Cleanup

2007-01-08 13:45 • by Jeff S
110318 in reply to 110293
DarthGeek:
In SQL Server:

DECLARE @BadStrings VARCHAR(50)
SET @BadStrings = 'Hello' + CHAR(5) + ' There'

SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')

-----

That shoudl about do it.


It should about do it if you only want to replace only the very first "bad" character, and if you are defining bad characters as including punctuation and other things that you probably would want to keep.

Re: The Data Cleanup

2007-01-08 14:20 • by JL (unregistered)
110331 in reply to 110279
Anon:
This is when you need to write a (Perl/Python/PHP/C/whatever) program that writes your program. It's the only thing that will keep you out of the padded office if you have to, dog forbid, change it.

No kidding. For a moment I thought that this code was generated in such a way, but then I saw the random capitalization. I guess you'd want something like:

s = columnName
for i in range(1, 32):
s = 'replace(%s,CHR(%d),NULL)' % (s, i)

Re: The Data Cleanup

2007-01-08 14:25 • by slamb
In Oracle, replace('foo', null) is just a stupid way of writing replace('foo', ''), because in Oracle NULL and '' are the same thing. (Which is a WTF for another day...)

There are (at least) three WTFs here:

(1) They regularly have to run a script to remove this sort of thing from a well-respected data source (can't they just fix whatever keeps putting them in there, and run it once on the already-entered data?)

(2) The writer of this fragment apparently didn't know how to write functions in SQL, though someone must have written "filter" and "translate". Even if this horrible expresssion were justified once, there's no reason to write it out once for each of the 32 columns.

(3) They aren't using regular expressions, or built-in translate function correctly...they could at least do
translate('x' || chr(1) || chr(2) || chr(3) || ... chr(31), 'x')
which would strip the control characters without all the nested calls. (I guess their translate is something else...IIRC, Oracle would allow a user-defined translate/1 (one-argument translate) to co-exist with the builtin translate/3 (three-argument translate).

Re: The Data Cleanup (where's the WTF ?)

2007-01-08 14:29 • by deroby (unregistered)
Mind that this is clean-up code : write once, execute once!

Not quite T-SQL I think, so I can't be sure on what other 'better' functions might be available; but apart from using a UDF (that holds more or less the same code, be it better 'hidden' and most probably a lot slower) I don't think there is much improvement to be made on this.

From personal experience, more often than not you're given 'full' access (SUID) to a set of tables, but find yourself totally locked away from making your own objects (eg. functions) when doing ETL stuff, so the guy might have had to do what was available and did so nicely. Although it might not be the nicest thing to look at, it's correct and it's fast! Where's the WTF ?

I'm pretty sure the author in question wrote a little script to generate given code (32 x 32 is A LOT of replaces =), if not, well hmm, THAT might be a WTF indeed =)

PS: translate is a 'standard' SQL function according to "SQL in a nutshell", not quite sure I understand 'what it can look like' ?!

Translate : Converts a string from one character set to another.

Re: The Data Cleanup

2007-01-08 14:36 • by Gsquared (unregistered)
110336 in reply to 110293
DarthGeek:
In SQL Server:

DECLARE @BadStrings VARCHAR(50)
SET @BadStrings = 'Hello' + CHAR(5) + ' There'

SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')

-----

That shoudl about do it.


That will eliminate the first instance of a bad character, but not more.

For example:

DECLARE @BadStrings VARCHAR(50)
SET @BadStrings = 'Hello' + CHAR(5) + char(2)+' There'

SELECT REPLACE(@BadStrings, SUBSTRING(@BadStrings, PATINDEX('%[^a-zA-Z0-9 '''''']%', @BadStrings), 1), '')

(I added a second, different, undesired character)

leaves a char(2) in the result.

A UDF with a loop in it like:

while patindex(@string, '%[-]%', @string) > 0
select @string = replace(@string, substring(@string, patindex(@string, '%[-]%', @string), 1), '')

Will clean up the whole string. Performance isn't too bad, and it's much easier to maintain than 30+ nested replace() functions.

(I hope the characters in my example show. I copied-and-pasted them into the editor here.)

Re: The Data Cleanup

2007-01-08 14:37 • by slamb
deroby, you're wrong on all points. It sounds like they're running this regularly, it's not a question of access (someone of similar skill must have written filter and translate), and they're not using the built-in translate function...it's overloaded with their own. The performance penalty for user-defined functions obviously is not a concern if they're calling two of them and nesting 32 calls to a built-in (31 of which are unnecessary). They must not have written a script to generate this, since the case is not consistent.

Re: The Data Cleanup

2007-01-08 16:01 • by Spencer (unregistered)
And what if the feed is actually sending ISO-8859-1 or UTF-8? Then what? I would hope there's a better solution than enumerating all the "missing" characters in the Unicode space!

Oracle supports in-DB Java since at least Oracle 8.

Re: The Data Cleanup

2007-01-08 16:44 • by wpgDBA (unregistered)
110368 in reply to 110261
mav:
Good night... Can anyone tell me what exactly that does? I lost interest somewhere around the 15th "Replace".


I lost interest a lot sooner, but yes, I can tell you what it does. It replaces in the "name" column any character with an ASCII value of 1 to 31 with a null value. In other words, it strips out most "control" characters.

Of course, it COULD be written like this:

variable SRCH_STR VARCHAR2(40)

BEGIN
:SRCH_STR := chr(1)||chr(2)||chr(3)||chr(4)||chr(5) ;
:SRCHSTR := :SRCHSTR || chr(6)||chr(7)||chr(8)||chr(9)||chr(10) ;
:SRCHSTR := :SRCHSTR ||
chr(11)||chr(12)||chr(13)||chr(14)||chr(15) ;
:SRCHSTR := :SRCHSTR || chr(16)||chr(17)||chr(18)||chr(19)||chr(20) ;
:SRCHSTR := :SRCHSTR ||
chr(21)||chr(22)||chr(23)||chr(24)||chr(25) ;
:SRCHSTR := :SRCHSTR || chr(26)||chr(27)||chr(28)||chr(29)||chr(30)||chr(31) ;
END;
/




UPDATE OWNER_USER.all_candidates
SET name = filter(translate(
dbms_standard.translate( name, 'a'||:SRCH_STR, 'a' )
)), ...


This perversion probably occurred because somebody DID write their own TRANSLATE() function (therefore my "dbms_standard." junk).

But I don't even want to go there ...

Re: The Data Cleanup

2007-01-08 18:19 • by Steve (unregistered)
Probably written by a DBA with a obsessive compulsive disorder

Re: The Data Cleanup

2007-01-08 19:43 • by treypole (unregistered)
The biggest WTF here is that it is being done in the database. Doesn't matter what RDBMS it is. This sort of string manipulation belongs in a scrubbing application.
It's like building HTML in the database. Why?

Re: The Data Cleanup

2007-01-08 21:12 • by brendan
The main problem is that it is done at the wrong level, it should of been implemented as a stored function, so that the 32 other columns could call it. Or it should of been part of the applications job to validate the columns. Also it could of been implemented as a trigger.

The other problem is the way it was done. It could of been done like this (implemented using PostgreSQL, Note I've only known about Stored Procedures for a week, so if there is any little errors, blame it on lack of experience with them):


CREATE OR REPLACE FUNCTION cleanUpCandidateColum(TEXT columnData)
RETURNS TEXT
BEGIN
RETURN regexp_replace(columnData,'[\x1-\x1F]+', '', 'g');
END;

Auto-generated code

2007-01-08 21:31 • by Been there, done that (unregistered)
I wouldn't be surprised if this code was actually generated by a program, maybe something which translates SQL queries from one database to another...

Re: The Data Cleanup

2007-01-09 06:45 • by ChrisH (unregistered)
I'm not going to read the comments.

I just want to say.. "RegEx primer"

PS this place has the funniest CAPTCHA, thanks :)


Re: The Data Cleanup

2007-01-09 07:22 • by Da' Man (unregistered)
I was working for a very large US-company (which actually should know better), and we did all the system integration in VBScript (!) - where any kind of necessary text encoding cleanup was done like this:

out = replace(out, "ä", "ae")
out = replace(out, "Ä", "Ae")
...
out = replace(out, "é", "e")
...

etc., etc.. hundreds of lines like this. This didn't even take into account that the replacement Ä -> Ae only works with some languages.

Oh yes, and the target encoding was actually UTF-8...

I quit there after half a year.

DM

Captcha: Java

Re: The Data Cleanup

2007-01-09 08:23 • by APC (unregistered)
I will admit to having written code which did something like this, for a routine which generated XML files from Oracle tables. You see, ASCII 0-8, 11-12, 14-31 are not valid XML characters, so I had to strip them out from the data. I just used a single REPLACE call which was nowhere near as ugly as this example.

Cheers, APC

Re: The Data Cleanup

2007-01-09 10:37 • by deroby (unregistered)
110451 in reply to 110337
slamb: not going to start a YesNo war here. I simply wanted to point out that sometimes there are situations where you need 'do' with the tools you've been offered. Nested replaces are not that terrible IMHO. Sure it can be done nicer/faster/better/whatever, but without proper context I consider it a bit harsh on the author to call this a proper WTF.

Makes me think of situations where you're sent to some company with the mission to convert some mapping from the "mainframe" to "the application server". You're fully briefed and prepared on the functional part, you know plenty of the app-servers architecture and best of all, you get the *entire* week to 'massage' a measly couple of ten-thousand lines into the new format, easy!
Off course the local IT guys will love to help you out ("damn external overpaid consultant I don't have time for"), so they give you terminal access to a 'well-respected data source' (not familiar with AS400 ? Don't worry, it's self-explanatory).
After 2 days lost you manage to write the files to disk and FTP them to your server where you then get a 'just-a-bit-more-than-public' account to manipulate and load the data again, only to find they're full of strange characters after you BCP the file in.
So you start looking for problems and fix them one by one, hoping it will all be done before the end of the week because you don't want to miss your flight back home.
Sometimes getting it working is a matter of economics (read: within the allotted time-frame and budget), not proper design. Sad but true. Does that make it a WTF, imho not, YMMV.

FWIW :
* Copy pasting this only takes only 5 iterations, casing seems to contradict scripting, but heck, I HOPE the other columns were cased identically ?!
* The Translate() and Filter() functions make my 'not enough access' a moot point indeed, although it might be that he only added those afterwards and didn't want to break what already worked.
* Although NOW I'd rather "loose" my first days fighting the local powers that be in order to get the proper tools for the task (eg. DTS or alike for a job like this), I've run into 'described' situations myself while I was a 'newbie' and 'getting it done' was what counted!
* Although it's not said explicitly that this script runs on a regular basis, if they can than this is proof that he made it 'rerunnable', much better than those "manuscripts" left behind by "some experts" that explain in 18 steps using 15 tools how to redo the reload of the data
* ETL can be hell.

As a general note on this web-site : get real guys, there's way too much elitists (sp?) here. Sure it's cool to write "... I quit the week after that ...", but that doesn't make the problem go away. If everyone would react like that its no surprise that people get replaced by technical "know-nothings" with total lack of "coding-style" and maybe a funny accent, because at least THAT guy got something working the way "the boss" had requested. (sort off).
Although I like to think the opposite too sometimes, coding is NOT a call, it's a JOB.

Re: The Data Cleanup

2007-01-09 11:00 • by times1 (unregistered)
So this is what happens when you have bad karma: you come back as a SQL programmer writing crap like this.

And "repeated nearly verbatim for the 32 other columns in the table". That really is icing on the WTF cake.

Re: The Data Cleanup

2007-01-09 12:54 • by anony-mouse (unregistered)
110496 in reply to 110314
Ross:
You people are morons. This is the only way to perform the required operation in sql without sacrificing performance. Not all programming is rainbows and sunshine. There will come days with the code gets ugly in oreder to perform. Get a grip, and let this one go.


mmm, I love the way troll smells in the morning
« PrevPage 1 | Page 2Next »

Add Comment