The Daily WTF: Curious Perversions in Information Technology

2007-04-10 Reply Admin

....

Harsh · 2007-04-10 Reply Admin

I used to deal with a pretty messed up database but not like this. (fist)

2007-04-10 Reply Admin

Yet another racist article... Just what oh what is this site coming to?!

evanm · 2007-04-10 Reply Admin

Alright, WTF?!?!?!?!? Please, more context. Without any, this just looks like some superfluous debug code that was never deleted, but is no longer used.

KattMan · 2007-04-10 Reply Admin

I agree this makes no sense. Setting Identity Insert ON only has an affect for INSERT statement. Between setting it on and off there is only a select statement. How is this relevant? Additionally, what does the identity column have to do with unique usernames? Is the username field used as the identity?

What, pray tell, is this article about?

2007-04-10 Reply Admin

I think the whole site should be revamped to say what is wrong with the code and what is the right way to do it.

It could be educational for a lot of people instead of just snarky. Although snarky is good. Maybe just put a link in of the best way to do something. Or maybe all the wiseguys can argue about the best way to do it.

kswanton · 2007-04-10 Reply Admin

that guy:
Yet another racist article... Just what oh what is this site coming to?!

Huh?

2007-04-10 Reply Admin

kswanton:
that guy:
Yet another racist article... Just what oh what is this site coming to?!

Huh?

see previous article comments =)

2007-04-10 Reply Admin

@that guy: racist? huh? I don't get it.

@OP: I'm not a database guy. Why can't all the nulls and blanks and duplicates be corrected in the database somehow? (ie: delete bogus ones, send notifications to users to change duplicate usernames by xxxx date, and after that, fix the rest of the usernames and let those affected get it resolved on their own schedule.)

CapitalT · 2007-04-10 Reply Admin

foreach( user in worsethanfailure ) { yell( user, "DON'T FEED THE TROLL" ); }

2007-04-10 Reply Admin

Scraping the barrels of imagination, I see.

captcha: atari

vt_mruhlin · 2007-04-10 Reply Admin

newbie:
I think the whole site should be revamped to say *what is wrong with the code and what is the right way to do it* .... Or maybe all the wiseguys can argue about the best way to do it.

That seems to be the primary purpose of the comments section anyhow. It generally turns into a collaborative discussion of how we could make that crap better.

In some ways I'm glad to see others are as baffled as I am on this one. On the other hand, I was kind of hoping somebody would figure it out and I'd learn something new.

snoofle · 2007-04-10 Reply Admin

ALTER PROCEDURE [dbo].[ZZZZZZZZJunk] AS
  SET IDENTITY_INSERT ARTrans1  ON
  select 'Cross your fingers...'
  SET IDENTITY_INSERT ARTrans1  OFF

.

Ok, just guessing here...

'select ...' was probably intended to be 'insert...', which would make sense given the statement that the db has no uniqueness constraints (at least for username). Otherwise, it is truly a wtf as insert-flags have no effect on select's.

Then again, why would you have identity checks enabled when no unique constraints exist? I mean, why not make the whole thing a bunch of varchar(255)'s?

Sort-of related aside: I've just been given the task of figuring how much storage to allocate for a db. The data will be coming to us in an xml doc. The schema for the xml has every single field listed as unlimited length strings in repeatable groups of 0..infinity repetitions. I told the guy he'll need somewhere between 0..infinity GB of storage.

Admittedly, I'm not a DB guy, but come-on!!!

sigh

snoofle · 2007-04-10 Reply Admin

vt_mruhlin:
newbie:
I think the whole site should be revamped to say *what is wrong with the code and what is the right way to do it* .... Or maybe all the wiseguys can argue about the best way to do it.

That seems to be the primary purpose of the comments section anyhow. It generally turns into a collaborative discussion of how we could make that crap better.

In some ways I'm glad to see others are as baffled as I am on this one. On the other hand, I was kind of hoping somebody would figure it out and I'd learn something new.

So 'crap' is available in different levels of quality? This will inevitably lead to conversations such as...

I am not satisfied with this crap; I'd like better crap; top quality crap!

vt_mruhlin · 2007-04-10 Reply Admin

snoofle:
vt_mruhlin:
newbie:
I think the whole site should be revamped to say *what is wrong with the code and what is the right way to do it* .... Or maybe all the wiseguys can argue about the best way to do it.

That seems to be the primary purpose of the comments section anyhow. It generally turns into a collaborative discussion of how we could make that crap better.

In some ways I'm glad to see others are as baffled as I am on this one. On the other hand, I was kind of hoping somebody would figure it out and I'd learn something new.
So 'crap' is available in different levels of quality? This will inevitably lead to conversations such as...

I am not satisfied with this crap; I'd like better crap; top quality crap!

If you worked for my company, that wouldn't be a joke...

2007-04-10 Reply Admin

Mrs Krieske: But this is a bag of shit. Quizmaster: Ah, but it’s GREAT shit, Mrs. Krieske.

-Firesign Theatre

snoofle · 2007-04-10 Reply Admin

vt_mruhlin:
snoofle:
vt_mruhlin:
newbie:
I think the whole site should be revamped to say *what is wrong with the code and what is the right way to do it* .... Or maybe all the wiseguys can argue about the best way to do it.

That seems to be the primary purpose of the comments section anyhow. It generally turns into a collaborative discussion of how we could make that crap better.

In some ways I'm glad to see others are as baffled as I am on this one. On the other hand, I was kind of hoping somebody would figure it out and I'd learn something new.
So 'crap' is available in different levels of quality? This will inevitably lead to conversations such as...

I am not satisfied with this crap; I'd like better crap; top quality crap!

If you worked for my company, that wouldn't be a joke...

Sadly, it is the same here.

2007-04-10 Reply Admin

snoofie: Maybe you already know what you're doing with the database stuff, but the textbook "Information Retrieval: Algorithms and Heuristics" chapter 6.4 gives a schema and procedure for storing XML in a relational database (not sure which normal form, but it isn't just "shove it all in a CLOB") in order to efficiently search for data. Apparently you can recreate the exact XML file using only what's in the database. Doesn't exactly help with provisioning space for the database, but it does give you a way to deal with the 0..inf repetitions of various fields. HTH.

Amazon link.

snoofle · 2007-04-10 Reply Admin

vtbassmatt:
snoofie: Maybe you already know what you're doing with the database stuff, but the textbook "Information Retrieval: Algorithms and Heuristics" chapter 6.4 gives a schema and procedure for storing XML in a relational database (not sure which normal form, but it isn't just "shove it all in a CLOB") in order to efficiently search for data. Apparently you can recreate the exact XML file using only what's in the database. Doesn't exactly help with provisioning space for the database, but it does give you a way to deal with the 0..inf repetitions of various fields. HTH.
Amazon link.

Actually, I have no clue how to create a database (properly). We have a team of DBA's for that sort of stuff. I just got dragged into this to figure out how much space to allocate.

Interestingly, the rule here is no xml in the db (because querying via xpath is s-l-o-w; which kind of makes sense to me). Since this particular project has no need for the xml once the data is stored, the plan is to disect the xml into it's component fields (once) and shove it into something that's reasonably normalized/optimized for high speed queries.

Thanks for the link!

zip · 2007-04-10 Reply Admin

http://msdn2.microsoft.com/en-us/library/aa259221(SQL.80).aspx

Syntax

SET IDENTITY_INSERT [ database. [ owner. ] ] { table } { ON | OFF }

Remarks

At any time, only one table in a session can have the IDENTITY_INSERT property set to ON. If a table already has this property set to ON, and a SET IDENTITY_INSERT ON statement is issued for another table, Microsoft® SQL Server™ returns an error message that states SET IDENTITY_INSERT is already ON and reports the table it is set ON for.

If the value inserted is larger than the current identity value for the table, SQL Server automatically uses the new inserted value as the current identity value.

The setting of SET IDENTITY_INSERT is set at execute or run time and not at parse time.

I'm guessing the section in bold is why they suggest you cross your fingers. And anytime you're manually inserting into an autonumber field (aside from maybe some migration/setup script)... that's probably a sign that stuff is fubar'ed.

2007-04-10 Reply Admin

snoofle:
ALTER PROCEDURE [dbo].[ZZZZZZZZJunk] AS
  SET IDENTITY_INSERT ARTrans1  ON
  select 'Cross your fingers...'
  SET IDENTITY_INSERT ARTrans1  OFF
.
Ok, just guessing here...

'select ...' was probably intended to be 'insert...', which would make sense given the statement that the db has no uniqueness constraints (at least for username). Otherwise, it is truly a wtf as insert-flags have no effect on select's.

Then again, why would you have identity checks enabled when no unique constraints exist? I mean, why not make the whole thing a bunch of varchar(255)'s?

Sort-of related aside: I've just been given the task of figuring how much storage to allocate for a db. The data will be coming to us in an xml doc. The schema for the xml has every single field listed as unlimited length strings in repeatable groups of 0..infinity repetitions. I told the guy he'll need somewhere between 0..infinity GB of storage.

Admittedly, I'm not a DB guy, but come-on!!!

sigh

Hmmmm, no I think the SELECT statement is basically a kind of Debug.WriteLine statement. I've seen many stored procedures that use SELECT statements for "writing something to the console".

Execute it in Query Analyzer and you have a nice sequence of debug output. I know there is a print statement. Don't know why people don't use it.

2007-04-10 Reply Admin

you mention Hurricane Katrina??? RACIST!!! RACIST!!! RACIST!!! RACIST!!! RACIST!!! RACIST!!!

2007-04-10 Reply Admin

hung:
you mention Hurricane Katrina??? RACIST!!! RACIST!!! RACIST!!! RACIST!!! RACIST!!! RACIST!!!

Yeah, seriously! You might as well call Hurricane Katrina a GOOK!

2007-04-10 Reply Admin

That code doesn't do anything except return 'Cross your fingers', I'm guessing you meant to have an insert statement?

You guys are reaching a bit for WTFs these days

2007-04-10 Reply Admin

[quote user="igitur]Hmmmm, no I think the SELECT statement is basically a kind of Debug.WriteLine statement. I've seen many stored procedures that use SELECT statements for "writing something to the console". [/quote]

You could also use it to insert into a table using the insert exec syntax.

Eg: insert MyTable exec DumbStoredProc

It'd try and insert the piped return of the proc into the table you specified.

2007-04-10 Reply Admin

249 Users with over 400 Johnses and Joseses ?

Nulls and blanks aren't the only problems!

evanm · 2007-04-10 Reply Admin

Alex:
... resulting in a now-unchangeable 249 users with a username of "null", 103 with the username "" (that's blank), and over 400 Jose's and John's.

Steven:
249 Users with over 400 Johnses and Joseses ?
Nulls and blanks aren't the only problems!

Nope. Apparently the totality of the the problem rests on your inability to read and properly interpret a list of values :P

qbolec · 2007-04-10 Reply Admin

Steven:
249 Users with over 400 Johnses and Joseses ?
Nulls and blanks aren't the only problems!

Nice catch. Maybe they stored them in the single cell, hashseparated.

2007-04-10 Reply Admin

Isn't it obvious?

The application probably uses a top secret code to encrypt much of its data. For better security, a new top secret code is generated each day, by the Top Secret Code Of The Day Server (TSCOTDS).

A process (the Top Secret Code Of The Day Update Process) on the database server retrieves the TSCOTD from the TSCOTDS and writes it into the database (inserting it as a new row to maintain a history of TSCOTDs). The code is stored in the single column of the ARTrans1 table. This process and table were created by Arthur Reed, a programmer who is no longer with the company. (There were actually two "Arthur Reed"s at the company at that time; this process was created by the transexual one, who has since become Amy Reed. She left the company after her operation.) This system was working fine until recently.

On March 12 this year, the DBA came in to work to find the database server's disk full. The ARTrans1 table had grown enormously. It seems the TSCOTDUP was in a tight loop, inserting the TSCOTD over and over, rather than once a day as it should. The programmers hypothesized that the bug might have something to do with the daylight savings time switch on the 11th, but were unable to find the source code for the TSCOTDUP, and therefor could not debug the problem. Ms Reed, perhaps wanting to make a clean break with her male past, had left no contact information, and could not be found. So the programmers did what anyone would do. They made the column in ARTrans1 an identity so the inserts would fail. Once daily, they momentarily set IDENTITY_INSERT ON, and cross their fingers that exactly one insert succeeds during the IDENTITY_INSERT ON window.

qbolec · 2007-04-10 Reply Admin

evanm:
Alex:
... resulting in a now-unchangeable 249 users with a username of "null", 103 with the username "" (that's blank), and over 400 Jose's and John's.

Steven:
249 Users with over 400 Johnses and Joseses ?
Nulls and blanks aren't the only problems!

Nope. Apparently the totality of the the problem rests on your inability to read and properly interpret a list of values :P

Nice catch, too:) But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

snoofle · 2007-04-10 Reply Admin

Tom Future:
Isn't it obvious?
The application probably uses a top secret code to encrypt much of its data. For better security, a new top secret code is generated each day, by the Top Secret Code Of The Day Server (TSCOTDS).

A process (the Top Secret Code Of The Day Update Process) on the database server retrieves the TSCOTD from the TSCOTDS and writes it into the database (inserting it as a new row to maintain a history of TSCOTDs). The code is stored in the single column of the ARTrans1 table. This process and table were created by Arthur Reed, a programmer who is no longer with the company. (There were actually two "Arthur Reed"s at the company at that time; this process was created by the transexual one, who has since become Amy Reed. She left the company after her operation.) This system was working fine until recently.

On March 12 this year, the DBA came in to work to find the database server's disk full. The ARTrans1 table had grown enormously. It seems the TSCOTDUP was in a tight loop, inserting the TSCOTD over and over, rather than once a day as it should. The programmers hypothesized that the bug might have something to do with the daylight savings time switch on the 11th, but were unable to find the source code for the TSCOTDUP, and therefor could not debug the problem. Ms Reed, perhaps wanting to make a clean break with her male past, had left no contact information, and could not be found. So the programmers did what anyone would do. They made the column in ARTrans1 an identity so the inserts would fail. Once daily, they momentarily set IDENTITY_INSERT ON, and cross their fingers that exactly one insert succeeds during the IDENTITY_INSERT ON window.

... Meanwhile, back at the ranch, the erstwhile Paula Bean wrote a customized bean to handle this special case. It would intentionally create a file named for the top-secret-code-of-the-day (i.e.: TSCOTD12345). Then, when the TSCOTDUP would attempt to insert the new code, it would loop (trying to open a file named for the proposed top-secret-code-of-the-day) until it received a FILE-NOT-FOUND error, thus indicating that the new code was unique.... (ok, someone else pick up the story...)

2007-04-10 Reply Admin

snoofle:
Tom Future:
Isn't it obvious?
The application probably uses a top secret code to encrypt much of its data. For better security, a new top secret code is generated each day, by the Top Secret Code Of The Day Server (TSCOTDS).

A process (the Top Secret Code Of The Day Update Process) on the database server retrieves the TSCOTD from the TSCOTDS and writes it into the database (inserting it as a new row to maintain a history of TSCOTDs). The code is stored in the single column of the ARTrans1 table. This process and table were created by Arthur Reed, a programmer who is no longer with the company. (There were actually two "Arthur Reed"s at the company at that time; this process was created by the transexual one, who has since become Amy Reed. She left the company after her operation.) This system was working fine until recently.

On March 12 this year, the DBA came in to work to find the database server's disk full. The ARTrans1 table had grown enormously. It seems the TSCOTDUP was in a tight loop, inserting the TSCOTD over and over, rather than once a day as it should. The programmers hypothesized that the bug might have something to do with the daylight savings time switch on the 11th, but were unable to find the source code for the TSCOTDUP, and therefor could not debug the problem. Ms Reed, perhaps wanting to make a clean break with her male past, had left no contact information, and could not be found. So the programmers did what anyone would do. They made the column in ARTrans1 an identity so the inserts would fail. Once daily, they momentarily set IDENTITY_INSERT ON, and cross their fingers that exactly one insert succeeds during the IDENTITY_INSERT ON window.
... Meanwhile, back at the ranch, the erstwhile Paula Bean wrote a customized bean to handle this special case. It would intentionally create a file named for the top-secret-code-of-the-day (i.e.: TSCOTD12345). Then, when the TSCOTDUP would attempt to insert the new code, it would loop (trying to open a file named for the proposed top-secret-code-of-the-day) until it received a FILE-NOT-FOUND error, thus indicating that the new code was unique.... (ok, someone else pick up the story...)

Upon realizing her success, Paula went out to celebrate. At the bar, she met Amy. Amy quickly realized that they were kindred spirits, and decided that she would rather be Arthur again. Thus, they decided to go at it. Unfortunately, when attempting to couple, there was an InitiationFailureException... Next!

2007-04-10 Reply Admin

evanm:
Alex:
... resulting in a now-unchangeable 249 users with a username of "null", 103 with the username "" (that's blank), and over 400 Jose's and John's.

Steven:
249 Users with over 400 Johnses and Joseses ?
Nulls and blanks aren't the only problems!

Nope. Apparently the totality of the the problem rests on your inability to read and properly interpret a list of values :P

At least Steven didn't repeat that depressingly common grammatical transgression of the inappropriate apostrophe.

2007-04-10 Reply Admin

qbolec:
But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

Why is it odd? Visit the southwestern United States, for example, and you'll find both Johns and Joses in large numbers.

2007-04-10 Reply Admin

qbolec:
evanm:
Alex:
... resulting in a now-unchangeable 249 users with a username of "null", 103 with the username "" (that's blank), and over 400 Jose's and John's.

Steven:
249 Users with over 400 Johnses and Joseses ?
Nulls and blanks aren't the only problems!

Nope. Apparently the totality of the the problem rests on your inability to read and properly interpret a list of values :P
Nice catch, too:) But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

Well, there's this bird sitting in a tree, when along comes a bee with a stinger...

KattMan · 2007-04-10 Reply Admin

Will:
qbolec:
But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

Why is it odd? Visit the southwestern United States, for example, and you'll find both Johns and Joses in large numbers.

Not to mention we do not have the total number of people listed. It would be odd if there were 400 Johns in a 700 record database, but to have 400 in a seven million record database would be nothing.

2007-04-10 Reply Admin

Continued:
snoofle:
Tom Future:
Isn't it obvious?
The application probably uses a top secret code to encrypt much of its data. For better security, a new top secret code is generated each day, by the Top Secret Code Of The Day Server (TSCOTDS).

A process (the Top Secret Code Of The Day Update Process) on the database server retrieves the TSCOTD from the TSCOTDS and writes it into the database (inserting it as a new row to maintain a history of TSCOTDs). The code is stored in the single column of the ARTrans1 table. This process and table were created by Arthur Reed, a programmer who is no longer with the company. (There were actually two "Arthur Reed"s at the company at that time; this process was created by the transexual one, who has since become Amy Reed. She left the company after her operation.) This system was working fine until recently.

On March 12 this year, the DBA came in to work to find the database server's disk full. The ARTrans1 table had grown enormously. It seems the TSCOTDUP was in a tight loop, inserting the TSCOTD over and over, rather than once a day as it should. The programmers hypothesized that the bug might have something to do with the daylight savings time switch on the 11th, but were unable to find the source code for the TSCOTDUP, and therefor could not debug the problem. Ms Reed, perhaps wanting to make a clean break with her male past, had left no contact information, and could not be found. So the programmers did what anyone would do. They made the column in ARTrans1 an identity so the inserts would fail. Once daily, they momentarily set IDENTITY_INSERT ON, and cross their fingers that exactly one insert succeeds during the IDENTITY_INSERT ON window.
... Meanwhile, back at the ranch, the erstwhile Paula Bean wrote a customized bean to handle this special case. It would intentionally create a file named for the top-secret-code-of-the-day (i.e.: TSCOTD12345). Then, when the TSCOTDUP would attempt to insert the new code, it would loop (trying to open a file named for the proposed top-secret-code-of-the-day) until it received a FILE-NOT-FOUND error, thus indicating that the new code was unique.... (ok, someone else pick up the story...)
Upon realizing her success, Paula went out to celebrate. At the bar, she met Amy. Amy quickly realized that they were kindred spirits, and decided that she would rather be Arthur again. Thus, they decided to go at it. Unfortunately, when attempting to couple, there was an InitiationFailureException... Next!

After some discussion, both Paula and Amy realized that nothing would work in the bathroom of the bar, so they sought out a wooden table. Naturally, everyone with a cell phone was snapping pictures as fast as possible. Next!

2007-04-10 Reply Admin

Continued:
snoofle:
Tom Future:
Isn't it obvious?
The application probably uses a top secret code to encrypt much of its data. For better security, a new top secret code is generated each day, by the Top Secret Code Of The Day Server (TSCOTDS).

A process (the Top Secret Code Of The Day Update Process) on the database server retrieves the TSCOTD from the TSCOTDS and writes it into the database (inserting it as a new row to maintain a history of TSCOTDs). The code is stored in the single column of the ARTrans1 table. This process and table were created by Arthur Reed, a programmer who is no longer with the company. (There were actually two "Arthur Reed"s at the company at that time; this process was created by the transexual one, who has since become Amy Reed. She left the company after her operation.) This system was working fine until recently.

On March 12 this year, the DBA came in to work to find the database server's disk full. The ARTrans1 table had grown enormously. It seems the TSCOTDUP was in a tight loop, inserting the TSCOTD over and over, rather than once a day as it should. The programmers hypothesized that the bug might have something to do with the daylight savings time switch on the 11th, but were unable to find the source code for the TSCOTDUP, and therefor could not debug the problem. Ms Reed, perhaps wanting to make a clean break with her male past, had left no contact information, and could not be found. So the programmers did what anyone would do. They made the column in ARTrans1 an identity so the inserts would fail. Once daily, they momentarily set IDENTITY_INSERT ON, and cross their fingers that exactly one insert succeeds during the IDENTITY_INSERT ON window.
... Meanwhile, back at the ranch, the erstwhile Paula Bean wrote a customized bean to handle this special case. It would intentionally create a file named for the top-secret-code-of-the-day (i.e.: TSCOTD12345). Then, when the TSCOTDUP would attempt to insert the new code, it would loop (trying to open a file named for the proposed top-secret-code-of-the-day) until it received a FILE-NOT-FOUND error, thus indicating that the new code was unique.... (ok, someone else pick up the story...)
Upon realizing her success, Paula went out to celebrate. At the bar, she met Amy. Amy quickly realized that they were kindred spirits, and decided that she would rather be Arthur again. Thus, they decided to go at it. Unfortunately, when attempting to couple, there was an InitiationFailureException... Next!

thankfully the senior dba finally came up with his wonderful solution. Every day, a low ranking techie was tasked with finding out the new filename that was created...(this of course was done by keeping an excel file on the shared drive with the list of all the current files in the directory) the techie would then search through the directory looking for the new file name that was not already in the excel spreadsheet, which he then added to the spreadsheet. after that the techie was to open the new file in notepad and print it...this would create a paper with the name of the file at the top. The techie then handed the paper over to the network administrators who would take a picture of the paper on a wooden desk of course and scan that in to the computer...(after taking it to the local walmart to be developed) they would then send the scanned image of the printout of the lastest file name through the company mail on a floppy disk to the dba's. ok i'm tired...

2007-04-10 Reply Admin

KattMan:
Will:
qbolec:
But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

Why is it odd? Visit the southwestern United States, for example, and you'll find both Johns and Joses in large numbers.

Not to mention we do not have the total number of people listed. It would be odd if there were 400 Johns in a 700 record database, but to have 400 in a seven million record database would be nothing.

Also worth noting that "over 400 Joses and Johns" is ambiguous; one could say that of a group of 300 Johns and 105 Joses.

cparker · 2007-04-10 Reply Admin

snoofle:
Sort-of related aside: I've just been given the task of figuring how much storage to allocate for a db. The data will be coming to us in an xml doc. The schema for the xml has every single field listed as unlimited length strings in repeatable groups of 0..infinity repetitions. I told the guy he'll need somewhere between 0..infinity GB of storage.
Admittedly, I'm not a DB guy, but come-on!!!

That just made my day. It probably also made my colleagues think I'm crazy, as it made me laugh quite noticeably, but it was worth that risk.

2007-04-10 Reply Admin

snoofle:
vtbassmatt:
snoofie: Maybe you already know what you're doing with the database stuff, but the textbook "Information Retrieval: Algorithms and Heuristics" chapter 6.4 gives a schema and procedure for storing XML in a relational database (not sure which normal form, but it isn't just "shove it all in a CLOB") in order to efficiently search for data. Apparently you can recreate the exact XML file using only what's in the database. Doesn't exactly help with provisioning space for the database, but it does give you a way to deal with the 0..inf repetitions of various fields. HTH.
Amazon link.
Actually, I have no clue how to create a database (properly). We have a team of DBA's for that sort of stuff. I just got dragged into this to figure out how much space to allocate.

Interestingly, the rule here is no xml in the db (because querying via xpath is s-l-o-w; which kind of makes sense to me). Since this particular project has no need for the xml once the data is stored, the plan is to disect the xml into it's component fields (once) and shove it into something that's reasonably normalized/optimized for high speed queries.

Thanks for the link!

I'd start with 10x the size of the XML file your importing if its a single import situation... 1000x the xml file if its a daily import. That should get things started and if you have space left over bully for you. But frankly you can tell nothing about the space requirements from a schema... especially a schema with unlimited length string fields.

snoofle · 2007-04-10 Reply Admin

JUST ANOTHER WTF:
snoofle:
vtbassmatt:
snoofie: Maybe you already know what you're doing with the database stuff, but the textbook "Information Retrieval: Algorithms and Heuristics" chapter 6.4 gives a schema and procedure for storing XML in a relational database (not sure which normal form, but it isn't just "shove it all in a CLOB") in order to efficiently search for data. Apparently you can recreate the exact XML file using only what's in the database. Doesn't exactly help with provisioning space for the database, but it does give you a way to deal with the 0..inf repetitions of various fields. HTH.
Amazon link.
Actually, I have no clue how to create a database (properly). We have a team of DBA's for that sort of stuff. I just got dragged into this to figure out how much space to allocate.

Interestingly, the rule here is no xml in the db (because querying via xpath is s-l-o-w; which kind of makes sense to me). Since this particular project has no need for the xml once the data is stored, the plan is to disect the xml into it's component fields (once) and shove it into something that's reasonably normalized/optimized for high speed queries.

Thanks for the link!

I'd start with 10x the size of the XML file your importing if its a single import situation... 1000x the xml file if its a daily import. That should get things started and if you have space left over bully for you. But frankly you can tell nothing about the space requirements from a schema... especially a schema with unlimited length string fields.

Agreed. The gory details are that I requested average string lengths and repetition counts for each generic field for each substructure for each type of xml doc. From that, I could figure out the amount of storage for an average xml record, multiplied by the average number of records per xml doc multiplied by the delivery frequency over the retention period. To that I'd add 50% for assorted indices and system overhead, and another 33% for temp space for running reports and whatnot.

Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... groans

KattMan · 2007-04-10 Reply Admin

snoofle:
Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... *groans*

The wonders of IT. Your company should either have them provide it, or charge them a research fee to compile this information for them. When it comes down to money, they either decide it is worth it to have you provide the info or they will find a way. Either way, you won't have to start until you have the information and you might actually find a new revenue option for existing customers.

2007-04-10 Reply Admin

Will:
KattMan:
Will:
qbolec:
But still, having such a large number of Jose's is kinda odd. I'm not a racist, but..where do they come from?

Why is it odd? Visit the southwestern United States, for example, and you'll find both Johns and Joses in large numbers.

Not to mention we do not have the total number of people listed. It would be odd if there were 400 Johns in a 700 record database, but to have 400 in a seven million record database would be nothing.

Also worth noting that "over 400 Joses and Johns" is ambiguous; one could say that of a group of 300 Johns and 105 Joses.

Being that there is no unique constraint on the username column them amy be only 1 John and 1 Jose in the company and that multiple rows were added for each for some reason. Obviously something is adding multiple rows with the same username value or you wouldn't be seeing all those nulls and blanks.

snoofle · 2007-04-10 Reply Admin

KattMan:
snoofle:
Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... *groans*

The wonders of IT. Your company should either have them provide it, or charge them a research fee to compile this information for them. When it comes down to money, they either decide it is worth it to have you provide the info or they will find a way. Either way, you won't have to start until you have the information and you might actually find a new revenue option for existing customers.

You, Sir, offer a common-sense approach, and are assuming our managers have brains in their heads rather than up points-south. I was requested (spelled o-r-d-e-r-e-d) to grind through the data to "just get it done".

seeking employment elsewhere

KattMan · 2007-04-10 Reply Admin

snoofle:
KattMan:
snoofle:
Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... *groans*

The wonders of IT. Your company should either have them provide it, or charge them a research fee to compile this information for them. When it comes down to money, they either decide it is worth it to have you provide the info or they will find a way. Either way, you won't have to start until you have the information and you might actually find a new revenue option for existing customers.
You, Sir, offer a common-sense approach, and are assuming our managers have brains in their heads rather than up points-south. I was requested (spelled o-r-d-e-r-e-d) to grind through the data to "just get it done".

seeking employment elsewhere

Oh I know, but I also know how to talk to management. Mention new revenue and they at least start to listen. Of course all this could be rolled into a new "set-up" fee. Of course if they are totally deaf to these suggestions it would make me wonder about the rest of the process. If it is totally WTF worthy, just send in some examples before finding a new job.

2007-04-10 Reply Admin

I must know if there really are a bunch of users with the username "null" or are there a bunch of null usernames?

2007-04-10 Reply Admin

KattMan:
snoofle:
Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... *groans*

The wonders of IT. Your company should either have them provide it, or charge them a research fee to compile this information for them. When it comes down to money, they either decide it is worth it to have you provide the info or they will find a way.

When pressed in this manner, any IT professional worth their salt will just make up this information and get back to reading Dilbert and remarking on the hilarious parallels with their own situation. Thank heavens my employers don't read this...

2007-04-10 Reply Admin

snoofle:
JUST ANOTHER WTF:
snoofle:
vtbassmatt:
snoofie: Maybe you already know what you're doing with the database stuff, but the textbook "Information Retrieval: Algorithms and Heuristics" chapter 6.4 gives a schema and procedure for storing XML in a relational database (not sure which normal form, but it isn't just "shove it all in a CLOB") in order to efficiently search for data. Apparently you can recreate the exact XML file using only what's in the database. Doesn't exactly help with provisioning space for the database, but it does give you a way to deal with the 0..inf repetitions of various fields. HTH.
Amazon link.
Actually, I have no clue how to create a database (properly). We have a team of DBA's for that sort of stuff. I just got dragged into this to figure out how much space to allocate.

Interestingly, the rule here is no xml in the db (because querying via xpath is s-l-o-w; which kind of makes sense to me). Since this particular project has no need for the xml once the data is stored, the plan is to disect the xml into it's component fields (once) and shove it into something that's reasonably normalized/optimized for high speed queries.

Thanks for the link!

I'd start with 10x the size of the XML file your importing if its a single import situation... 1000x the xml file if its a daily import. That should get things started and if you have space left over bully for you. But frankly you can tell nothing about the space requirements from a schema... especially a schema with unlimited length string fields.
Agreed. The gory details are that I requested average string lengths and repetition counts for each generic field for each substructure for each type of xml doc. From that, I could figure out the amount of storage for an average xml record, multiplied by the average number of records per xml doc multiplied by the delivery frequency over the retention period. To that I'd add 50% for assorted indices and system overhead, and another 33% for temp space for running reports and whatnot.

Of course, the vendor doesn't have that information (or is too lazy to dig it up), leaving me with the wonderful task of getting data files for the past six months and writing some scratch programs to crunch through it all and do the math like that. I should try to sell the stats back to the vendor... groans

better yet just take that directory where you gathered all the XML files find the size and x10.

no analysis needed...

2007-04-11 Reply Admin

The original developers never thought to put a unique constraint on the "Username" column, resulting in a now-unchangeable 249 users with a username of "null", 103 with the username "" (that's blank)

Look on the bright side - at least they're not using Oracle.

(I hate Oracle's null string problem.)

Cross Your Fingers ...

Leave a comment on “Cross Your Fingers ...”