The Daily WTF: Curious Perversions in Information Technology

Satanicpuppy · 2005-12-22 Reply Admin

TOPCTEH:
JamesCurran:

However, one thing which I haven't seen anyone comment on, is the fact that

1 Forrest St.
1 Forrest Str.
1 Forrest Street
One Forrest St.
One Forrest Str.
One Forrest Street
1 Forrest Ct
1 Forrest Court.
1 Forrest Ave.

And a whole bunch of other variation are (as far as the USPS is concerned) all the same address.

Gnaa, that reminds me of my first ever IT project: the change of Germany's zip code from four to five digits in 1993.

In that project we discovered DDFH (database design from hell): Two database tables were connected to each other by ... the address! WTF? But it became more worse: these two tables were updated by two different applications so the data diverged and two formerly connected records were not connected anymore. The whole project team desired to meet the developer of that crap in a quiet corner of a public park at 2:00am.

And BTW: Can you imagine how inventive users can be by creating address abbreviations?

Torsten

I actually maintain a huge USPS database for mailing customers, and the way we solved the abbreviation problem was forcing the users to use specific prefixes/suffixes, and not allowing regular users to add new streets.
We still have 10-20 errors every update period though. =/ (4 times a year we synch our db with the USPS...Ironically about half those errors are on their side, and I've seen 'em every update for years. Little Colllier Rd? I think not.)

cconroy · 2005-12-22 Reply Admin

RoKe:
If your mailpiece has 123 Main ST (my canonical test address) on it, and there are both a 123 E MAIN ST and a 123 MAIN CT, as I recall it will be delievered to 123 MAIN CT because changing the suffix is a considered smaller weight change than adding a directional.

This to me seems backwards. Isn't it more likely that the average person would forget the directional rather than use the wrong suffix? But what do I know, I can't comprehend the collective mind of the USPS.

Whichever bureau/department/what have you is in charge of street naming really should try harder to avoid these kinds of duplication. I recently went looking for the town hall in the town I just moved into. The address was something like 225 Main St., or so I thought. So I drive back and forth on Main St. a couple times, from the end, and only see 100-level numbers and lower, which are decreasing. Turns out the actual address was 225 E. Main St., which begins where Main St. "ends" (logically only, it's the same physical street), with the numbers increasing. Now, E and W I can sort of understand, if there's an obvious boundary like a major road or train tracks or whatever. But E and nothing, where it switches at some random intersection? WTF?

2005-12-22 Reply Admin

RevMike:
johnl:
I'm no DBA but I can't think of any situations where an artificial key would be really bad. I can think of situations where a natural key would be bad, and situations where either one would work, though.

For performance critical systems, a good natural key can be significantly better. The propogation of the key can mean that joins can be eliminated. The rule of thumb should always be, however, "When in doubt use an artificial key."

I agree with what you're saying for the most part but again I'd point out that the same thing can be accomplished using an artificial key. What eliminates the joins is that you've made a second copy of some of the data from the parent table in each row of the child tables, effectively you've denormalized the schema somewhat. Performing that denormalization is a separable issue from how you define your primary key. Nothing prevents you from denormalizing for performance just because you use an artificial key it just happens to be a side effect of using a natural key.

johnl · 2005-12-22 Reply Admin

@cconroy: We actually have a similar thing in the UK. There's a Chadwick Street and a Chadwick Street South, though from the road layout in that case it's obvious they're different.

Personally, though, in the example you gave, I reckon that Main Street, East Main Street and West Main Street should all be the same street. The directional should be there only to give you a clue which part of the street you're looking for, since in the US streets can get very long. So if they say "123 E Main St" then you know that it's going to be on Main Street, somehwere towards the East end of the street.

2005-12-22 Reply Admin

sao:
mlathe:
TankerJoe:
mlathe:
TankerJoe:
mlathe:
aikimark:

mlathe:
The worst thing is that he added the uniqueness to the front of address. I would have added whitespace to the end of the address

While that would be the sane thing to do, I'd almost bet that this was an MSAccess database. It automatically trims trailing space characters from text fields.

What I can't figure out is why this database wasn't under scrutiny as soon as the first mailing failed. Wasn't anyone paying attention?

meant to be a joke

Don't worry. I got it.

trying to be witty

thats what you get

next time i'll write

The worst thing is that he added the uniqueness to the front of address. I would have added whitespace to the end of the address <insert laughter=""></insert>

i just wanted it quoted ONCE more.

Ooh. That's fun.

christoofar · 2005-12-22 Reply Admin

Anonymous:
RevMike:
johnl:
I'm no DBA but I can't think of any situations where an artificial key would be really bad. I can think of situations where a natural key would be bad, and situations where either one would work, though.

For performance critical systems, a good natural key can be significantly better. The propogation of the key can mean that joins can be eliminated. The rule of thumb should always be, however, "When in doubt use an artificial key."

I agree with what you're saying for the most part but again I'd point out that the same thing can be accomplished using an artificial key. What eliminates the joins is that you've made a second copy of some of the data from the parent table in each row of the child tables, effectively you've denormalized the schema somewhat. Performing that denormalization is a separable issue from how you define your primary key. Nothing prevents you from denormalizing for performance just because you use an artificial key it just happens to be a side effect of using a natural key.

I think the entire argument boils down to this: there are very few panaceas in database design. There are only models, and its up to your planning (and some good generalities about the laws of physics) to implement the best model that fits the application at hand with the database engine you bought (or downloaded). The trick is in properly justifying your design decisions and testing your theories BEFORE you apply them (and are stuck with them).

It's very hard to fix a crufted database after it goes into production and starts to see heavy production use. I've done DB transitions, and those projects are long, hard and complicated.

TankerJoe · 2005-12-22 Reply Admin

Anonymous:
sao:
mlathe:
TankerJoe:
mlathe:
TankerJoe:
mlathe:
aikimark:

mlathe:
The worst thing is that he added the uniqueness to the front of address. I would have added whitespace to the end of the address

While that would be the sane thing to do, I'd almost bet that this was an MSAccess database. It automatically trims trailing space characters from text fields.

What I can't figure out is why this database wasn't under scrutiny as soon as the first mailing failed. Wasn't anyone paying attention?

meant to be a joke

Don't worry. I got it.

trying to be witty

thats what you get

next time i'll write

The worst thing is that he added the uniqueness to the front of address. I would have added whitespace to the end of the address <insert laughter=""></insert>

i just wanted it quoted ONCE more.

Ooh. That's fun.

Are we looking at a new WTFism?

sammybaby · 2005-12-22 Reply Admin

Anonymous:
I wish I knew the name of that developer because I think he found his way to the project I'm currently working on. He's our 'project manager'. I use the term reluctantly. One of his nicknames is "Donkie". One of his MANY claims to shame is that he stores dates as intergers (int). Just try and get records from one date to another with that kind of design. Best part is that the project he's mangling is a financial application that manages more than $35 million in assets.

Try to get records from one date to another.... okay. I'll give it a shot. How about all the records in October of this year?

SELECT * FROM appropriate_table WHERE date >= 20051001 AND date < 20061101 ORDER BY date;

Do I win?

2005-12-22 Reply Admin

Phone number and name is a good PK?

I guess your're not familiar with John Smith and his son John Smith who both live at the same address with one phone number? <sarcasm>No, nobody would ever name their kid after themself.</sarcasm>

2005-12-22 Reply Admin

brazzy:
KoFFiE:
The real power of using dates as integer numbers is calculating and comparing, if you know basic math operators like +, -, modulo, and the facts that one minute is 60sec, one hour is 3600secs, and one day 86400secs. Not that hard I think?

You think wrong. Many places use daylight savings time, which means that each year, they have one day that is 23 hours long and one that is 25 hours long.

Not to mention leap years and the odd leap second. For anyone who's concerned about keeping their clock sync'd with the Earth's rotation, timekeeping isn't quite so cut-and-dried. Granted, for almost all business cases, leap seconds can safely be ignored. About the only people who really care are astronomers.

rbriem · 2005-12-22 Reply Admin

dpm:
JamesCurran:

However, one thing which I haven't seen anyone comment on, is the fact that

1 Forrest St.
1 Forrest Str.
1 Forrest Street
One Forrest St.
One Forrest Str.
One Forrest Street
1 Forrest Ct
1 Forrest Court.
1 Forrest Ave.

And a whole bunch of other variation are (as far as the USPS is concerned) all the same address.

The first six are. Why you think "court" and "avenue" are evaluated to be "street" is beyond me, though.
I worked on USPS software for two years and can assure you they aren't. You can also look at
http://www.usps.com/ncsc/lookups/usps_abbreviations.html

ok
dpm

And I've had mail delivered to me by the USPS for *way* more than two years and I can assure you that they only care about court, street, avenue, etc. when there's a reason to. So there!

Honestly, these types of "corrections for corrections' sake" posts sound like kids squabbling. Arg.

James' point is that using a user-typed string as a primary key is troublesome - at my last job they had a table that was keyed on "first name + last name" and in it there were plenty of duplicate entries based on "John O'hara" = "John O'Hara" = "J Ohara" = "Jhon O'Hara" etc. So much easier to have a lookup table with standardized names keyed by an id ...

IMEO, of course.

2005-12-22 Reply Admin

I had a similar but more insidious bug not too long ago. The repro steps were just to open a file on a web server; the request redirected to the actual file on another server. Fair enough. It was failing approximately 4/5 times with a nonspecific download error. Turns out that the web server (an internal one we stock with a variety of sample documents) was doing some database-backed auditing with a script that used a time offset in seconds as (part of?) the primary key (I presume it was something like (originating address or user, page requested, time)). Two requests in the same second caused an internal server error on the second: ODBC error, duplicate primary key.

Tracking it down was a lot of fun (not really). Single-stepping in the debugger always succeeded, because it would be more than a second between requests (there were multiple requests made because of an IE library beyond our control); similarly it sometimes worked because the two requests would cross a second boundary. I ended up putting a lot of debug output statements in the code to narrow it down, watching the network traffic with NetMon, and then testing my hypothesis with a perl (LWP++) script and then in the browser. Clearly this wasn't our fault, and was resolved that way.

2005-12-22 Reply Admin

sammybaby:

Anonymous:
I wish I knew the name of that developer because I think he found his way to the project I'm currently working on. He's our 'project manager'. I use the term reluctantly. One of his nicknames is "Donkie". One of his MANY claims to shame is that he stores dates as intergers (int). Just try and get records from one date to another with that kind of design. Best part is that the project he's mangling is a financial application that manages more than $35 million in assets.

Try to get records from one date to another.... okay. I'll give it a shot. How about all the records in October of this year?

SELECT * FROM appropriate_table WHERE date >= 20051001 AND date < 20061101 ORDER BY date;

Do I win?

ha ha ha ha. No. Try your dates again.

2005-12-22 Reply Admin

> i feel i need to definitely study more about databases. is there a good book?

Data Modeling Essentials, 2nd edition by Graeme Simsion.

2005-12-22 Reply Admin

If you don't mind me asking, what's wrong with storing dates as integers? A common format I use is YYYYMMDD - dates sort nicely, it's easy to constrain this format, and getting a date range is a simple greater/less than question.

EG, for all of 2005: ...WHERE date >= 20050101 AND date <=20051231

Or, if you use unix epoch, to get everything within a date range, just divide by 86400.

The only negative is that you can't get +- NN days this way - but it's not hard to precalculate that before you submit the SQL statement.

So why would this be a WTF again?

PS: I manage finance-related data in the millions with my DB apps - and so far, these methods for keeping dates has always been spot on...

Gene Wirchenko · 2005-12-22 Reply Admin

kipthegreat:
Sidenote- I noticed in Jacksonville, FL that there is a "Blvd. St." and I thought that was a little silly.

Toronto, ON has an Avenue Road.

Vancouver, BC has Kent Avenue North and Kent Avenue South on either side of the rail line. East and west of the meridian have to be indicated so, there is
   E. Kent Ave. N.
     W. Kent Ave. N.
   E. Kent Ave. S.
     W. Kent Ave. S.
This is right by the South Foot of Main Street that I already mentioned.

Sincerely,

Gene Wirchenko

sammybaby · 2005-12-22 Reply Admin

Anonymous:
sammybaby:
Try to get records from one date to another.... okay. I'll give it a shot. How about all the records in October of this year?
SELECT * FROM appropriate_table WHERE date >= 20051001 AND date < 20061101 ORDER BY date;

Do I win?

ha ha ha ha. No. Try your dates again.

Ba! Ha. Hee.

Oops.

[:$]

(Originally, I was going to make the example December of this year, but then, for some asinine reason, I went, "No - December's not done yet. Do a month which is over." And so I changed the 20060101 to 20061101, instead of 20051001. Which is all by way of saying, I am a big dork.)

RevMike · 2005-12-22 Reply Admin

Anonymous:
If you don't mind me asking, what's wrong with storing dates as integers? A common format I use is YYYYMMDD - dates sort nicely, it's easy to constrain this format, and getting a date range is a simple greater/less than question.

EG, for all of 2005: ...WHERE date >= 20050101 AND date <=20051231

Or, if you use unix epoch, to get everything within a date range, just divide by 86400.

The only negative is that you can't get +- NN days this way - but it's not hard to precalculate that before you submit the SQL statement.

So why would this be a WTF again?

PS: I manage finance-related data in the millions with my DB apps - and so far, these methods for keeping dates has always been spot on...

I've seen this quite often. finance-related data + this particular idiom screams Sybase at me. I've been told that some older Sybase version did a particularly bad job of indexing and searching on date columns. This was a reasonable work around for a flaw in the RDBMS.

I wouldn't suggest designing a system like this now.

mjonhanson · 2005-12-22 Reply Admin

Anonymous:
I'm still just a hobbiest (so please forgive stupidity)

but wouldn't a phone number be a (always) unique natural key?

My wife and I have different last names but the same phone number. If we (independently) purchased something from this business, they'd still need to call the dumba$$ developer to "uniquify" the his idiotic key.

cconroy · 2005-12-22 Reply Admin

johnl:
Personally, though, in the example you gave, I reckon that Main Street, East Main Street and West Main Street should all be the same street. The directional should be there only to give you a clue which part of the street you're looking for, since in the US streets can get very long. So if they say "123 E Main St" then you know that it's going to be on Main Street, somehwere towards the East end of the street.

I agree. In my case, the entire street (well, the "Main" section of it -- it actually changes names a couple more times) is maybe two miles long. Probably overkill to "divide" it. But if you're going to call one half E, at least name the other section W!

This is minor compared to the numbering, though. In case I wasn't clear, the numbers start at 160 or so on one end, go down to zero, then increase again when it switches from E. Main to Main. So it's not the case for the 225 address I was looking for, but it is entirely possible that there are two 123s, one on Main and one on E. Main. This can get confusing really fast, especially if you're looking for a non-obvious building like I was (a house or small shop, say).

cconroy · 2005-12-22 Reply Admin

...er, "not looking for an obvious building like I was". Got my thoughts scrambled there.

BradC · 2005-12-22 Reply Admin

RevMike:
Generally speaking, the artificial versus natural key debate comes down to this: You'll never do anything bad by using artificial keys. There are some cases when you can do better using natural keys and lots of cases when you can do worse using natural keys. If you're not 100% about using a natural key, go with artificial.

Quoted for truthery.

Sometimes using artificial keys makes programming slightly harder or slightly slower (another join to get displayable data), but I can't think of any situations that using an artificial key would be bad design, or lead to horrible problems. I can think of lots of cases that badly using natural keys would.

ammoQ · 2005-12-22 Reply Admin

BradC:
Sometimes using artificial keys makes programming slightly harder or slightly slower (another join to get displayable data), but I can't think of any situations that using an artificial key would be bad design, or lead to horrible problems.

Generally I aggree, but here are some counter-examples:

Bad design:
Consider a database table that keeps program parameters, the kind of parameters you might
otherwise put into a .properties file or the Windows registry.
create table params ( id number /*autoincrement*/, name varchar(40), val varchar(200) );
In your program, you would of course retrieve this parameters by their name, not by their artificial ID. So in fact, "name" ist the key and "id" is just a totally useless appendix.

Horrible problems:
Consider a database table that is not maintained within the system, but regulary completely deleted and imported from an external source. For example, imagine you run some sort of franchise and the pricelist comes every month from the headquarters. If you accept their key as the natural key, it's easy: truncate table, import file. (For the sake of the argument, let's assume nothing is ever deleted from the pricelist, and nothing ever goes wrong ;-) The only premise is that you may not declare referential constraints in the database (not a big problem).
On the other hand, if you introduce artificial keys, and reference this table by this keys, it's much more difficult, since you cannot simply throw away the old contents - after the next import, your artificial keys would not match existing data. You must match the old pricelist with the new file - much more complicated. (Well, maybe not horrible, but lots of work anyway - both in programming and runtime performance)

2005-12-22 Reply Admin

BradC:
RevMike:
Generally speaking, the artificial versus natural key debate comes down to this: You'll never do anything bad by using artificial keys. There are some cases when you can do better using natural keys and lots of cases when you can do worse using natural keys. If you're not 100% about using a natural key, go with artificial.

Quoted for truthery.

Sometimes using artificial keys makes programming slightly harder or slightly slower (another join to get displayable data), but I can't think of any situations that using an artificial key would be bad design, or lead to horrible problems. I can think of lots of cases that badly using natural keys would.

The issue isn't really artifical key vs natural key. Its that most people (and the discussions on here proves it) HAVE NO IDEA WHAT A GOOD NATURAL KEY LOOKS LIKE.

Here are some things that are NOT good natual keys:
Names
Street Addresses
Phone Numbers
Email Addresses
Social Security Numbers

Natual keys have to be unique (thats always unique, not mostly unique), every row has to have one and the value must never change for that row. If it doesn't meet this critera then its not a good natural key. Use an artificial one instead.

Generally the only good natural keys are account numbers or similar assigned values. And even then using an artificial key wouldn't hurt.

And for you people storing dates in integer values WHEN YOUR DATABASE ALREADY HAS A PERFECTLY FUNCTIONAL DATE TYPE, stop it. Really, just stop it. You are just hurting yourselves and everyone around you. If you are using a database that doesn't have a date type then go download a real database that does have one.

olddog · 2005-12-22 Reply Admin

gotta agree with you. In fact, it's doubtful that any natural key is truly unique in real life. There's always the possibilty of natural duplication. A simple incremented value seems the best key to me.

2005-12-22 Reply Admin

KoFFiE:
Anonymous:
... One of his MANY claims to shame is that he stores dates as intergers (int). Just try and get records from one date to another with that kind of design. ...

Excuse me? Storing dates as integer numbers (preferably 64bit though) in seconds since 1 jan 1970 is imho one of the easiest ways to work with date/times. This is the way it works on any *nix system in the world.

Of course, storing today as "20051222" is likely to be less easy. Not to mention "22122005" or "12222005". And don't forget the leading zeros.

KoFFiE:

The real power of using dates as integer numbers is calculating and comparing, if you know basic math operators like +, -, modulo, and the facts that one minute is 60sec, one hour is 3600secs, and one day 86400secs. Not that hard I think? Comparing numbers is by far a lot easier for the developer than working with some complicated date/time library.

I'd also note that not all minutes have 60 seconds.

2005-12-22 Reply Admin

Very few data values make good natural keys. And even if they make a good natural key now they may not in the future. Companies merge, account numbers that were unique aren't anymore, stuff like that. Using natural keys buys almost nothing and can lead to pounding ones head on a wall later.

One more important database design rule, kids you might want to right this one down...

Combining several crappy natural keys does NOT make a good natural key, it just makes a longer, less efficient crappy natural key.

olddog · 2005-12-22 Reply Admin

good point runtime. Natural keys should not be a designed as a short-cut for searching. Let the db do the work.

RevMike · 2005-12-22 Reply Admin

olddog:
good point runtime. Natural keys should not be a designed as a short-cut for searching. Let the db do the work.

Just to be clear... Is it ok to use a composite key as a shortcut to search?

olddog · 2005-12-22 Reply Admin

rev - I should retreat. If that works with your db - great. Depends on what you expect to retrieve. If you're expecting a single record or many records, that depends on the query( assumes some sort of SQL ). Seems to me, the whole point of a key is it's uniqueness, composite or other wise. If you're searching for a unique record, you'll need a unique key, however you define it.

2005-12-22 Reply Admin

RevMike:
olddog:
good point runtime. Natural keys should not be a designed as a short-cut for searching. Let the db do the work.

Just to be clear... Is it ok to use a composite key as a shortcut to search?

That depends on what you mean...

Creating a composite primary key on a table using some natural key values that you hope are unique... BAD

Creating a composite index on a table in order to speed up a search for a few columns thats done a lot.... GOOD.

A database is not a hash table. You can efficiently search a table based on columns other than the primary key just as quickly by creating an index. Sure there is a tradeoff for creating another index, but thats nothing compared to the stygian hellspawn of a database schema you create by using composite natural keys hoping that the queries will go faster. It really doesn't take long before you start making compromises like in the original project. And then you end up on here.

olddog · 2005-12-22 Reply Admin

I couldn't agree more about the value of indexing. The point I was making originally was that numeric keys ( I didn't make this clear ) are faster than char keys ( as in the wtf ). And that as a PK they work well in an RDMS, specifically during joins. It's just been a goal of mine ( in my databases ) to reduce any foreign key to a number. It's proven to be fairly quick.

RevMike · 2005-12-22 Reply Admin

Anonymous:
RevMike:
olddog:
good point runtime. Natural keys should not be a designed as a short-cut for searching. Let the db do the work.

Just to be clear... Is it ok to use a composite key as a shortcut to search?

That depends on what you mean...

Creating a composite primary key on a table using some natural key values that you hope are unique... BAD

Creating a composite index on a table in order to speed up a search for a few columns thats done a lot.... GOOD.

A database is not a hash table. You can efficiently search a table based on columns other than the primary key just as quickly by creating an index. Sure there is a tradeoff for creating another index, but thats nothing compared to the stygian hellspawn of a database schema you create by using composite natural keys hoping that the queries will go faster. It really doesn't take long before you start making compromises like in the original project. And then you end up on here.

You responded around my question, but did not respond to my question. If you look back through this thread you'll see that I'm not ignorant of the issues inherent in using a natural key. My actual statement was that "it is never bad to use an artificial key; you can sometimes do better with a natural key, and will frequently do worse." Setting aside the artificial versus natural key discussion, do you consider it a good or bad practice to use a composite key?

I have seen plenty of crippled database designs because data architects failed to properly understand composite keys, requiring useless and costly joins when reporting upon multilevel master/detail relationships.

RevMike · 2005-12-22 Reply Admin

olddog:
I couldn't agree more about the value of indexing. The point I was making originally was that numeric keys ( I didn't make this clear ) are faster than char keys ( as in the wtf ). And that as a PK they work well in an RDMS, specifically during joins. It's just been a goal of mine ( in my databases ) to reduce any foreign key to a number. It's proven to be fairly quick.

Integer primary keys are important for Sybase/MSSql, but not important for Oracle or DB2.

olddog · 2005-12-22 Reply Admin

rev - if you define a compsite key as "two or more columns, designated together as a table's primary key." then, I'd say yes, it's uniqueness is "possible", but not a certainty. I suspect we are at odds about query response time. Which you are right. A search, by nature is usually not for a specific PK but for the contents of ( one or some ) columns within the record. Hence - an indexed ( concatenation ) of columns would be surely speed up the search.

2005-12-22 Reply Admin

RevMike:

You responded around my question, but did not respond to my question. If you look back through this thread you'll see that I'm not ignorant of the issues inherent in using a natural key. My actual statement was that "it is never bad to use an artificial key; you can sometimes do better with a natural key, and will frequently do worse." Setting aside the artificial versus natural key discussion, do you consider it a good or bad practice to use a composite key?

I have seen plenty of crippled database designs because data architects failed to properly understand composite keys, requiring useless and costly joins when reporting upon multilevel master/detail relationships.

Rev, sorry I misunderstood your question. In my opinion using composite keys is wrong in most cases. I'm sure there are cases where its the right answer. So the answer is its depends on whats being modelled.

I was specifically referring to the case where people take two bad natural keys (like name and phone number) and try to combine those into a primary key. Thats always a bad idea. If neither key was a good natural key on its own then combining them isn't going to help.

2005-12-22 Reply Admin

I am a web designer and PT Cold Fusion developer (which means I only SLIGHTLY understand most of this thread).

The reason I am posting is to tell all you guys that have you given me a big 'eye-opener" as to what you folks go through.

* Anonymous bows down to the DB developers. You guys frigging rock!

RevMike · 2005-12-22 Reply Admin

I reordered things a bit...

Anonymous:

I was specifically referring to the case where people take two bad natural keys (like name and phone number) and try to combine those into a primary key. Thats always a bad idea. If neither key was a good natural key on its own then combining them isn't going to help.

You are absolutely correct on this. Bad natural keys can't be combined into a good primary key.

Anonymous:

In my opinion using composite keys is wrong in most cases. I'm sure there are cases where its the right answer. So the answer is its depends on whats being modelled.

I have to strongly disagree with this. In multilevel master detail relationships - a fairly common construct - a composite key is much better than using several single column keys.

Take, for example, the structure of Customer-Order-OrderLineItem. The customer is identified by some sort of CustomerID. The CustomerID is 99.9% likely artificial, although there are a few small businesses where there may be a valid natural key. The Order table has its own artificial key, as well as a foreign key relationship to Customer. The OrderLineItem table has its own artificial key, as well as a foreign key relationship to Order. This works perfectly fine for Operational Reporting. We can account for every order made by a customer, and every line item that makes up each order. We can print packing lists and shipping labels. We can adjust inventory. Everything works.

Now, lets modify that model slightly. We still have a Customer-Order-OrderLineItem structure. However, now the primary key of Order is a composite of CustomerID and OrderID. The primary key of OrderLineItem is a composite of CustomerID, OrderID, and LineItemNumber. This still works perfectly fine for Operational Reporting. We haven't limited ourself in any way.

The difference shows up when we want to start doing simple Business Intelligence reporting from this database. Now we can also efficiently answer questions like "Who are the top 10 volume purchasers of doohickeys?" The first model requires that we get to CustomerID through Order, and so we need to first join Order and OrderLineItem, then perform our aggregations against this result set in memory. The second model doesn't require the join to Order at all. The aggregation occurs against a single table. In addition, the index will typically allow the aggregation to be streamed instead done entirely within memory. We've drastically reduced the IO and reduced the required memory. For a reasonably large system, we could easily be looking at an order of magnitude increase in speed.

RevMike · 2005-12-22 Reply Admin

Anonymous:
I am a web designer and PT Cold Fusion developer (which means I only SLIGHTLY understand most of this thread).

The reason I am posting is to tell all you guys that have you given me a big 'eye-opener" as to what you folks go through.

* Anonymous bows down to the DB developers. You guys frigging rock!

Good DB developers can rarely write decent application code, and decent application developers rarely are good at DB development. It really is a completely separate discipline.

After 7 years of developing for a really interesting database application (a financial system spread across 90+ geographically diverses nodes with selective replication, running on Oracle 6 on VAX/VMS), I selected to be trained in Java. I took to it very well, but it took months before my object models stopped looking like normalized schemas.

I now work for a software vendor that sells data integration products. I need people who can write DB and application code. I'll go through 50 "pre-qualified" resumes at a time before I get someone who can describe both the Java thread model and query tuning.

olddog · 2005-12-23 Reply Admin

rev - I don't see the payoff for the construct. If you needed "Who are the top 10 volume purchasers of doohickeys" , why not query SUM(OrderLineItem) WHERE doohickey=x GROUP BY CustomerID

Although... the slick part of the construct is that artifical keys are not likely to change which gives extreme confidence to the composite key.

RevMike · 2005-12-23 Reply Admin

olddog:
rev - I don't see the payoff for the construct. If you needed "Who are the top 10 volume purchasers of doohickeys" , why not query SUM(OrderLineItem) WHERE doohickey=x GROUP BY CustomerID

Although... the slick part of the construct is that artifical keys are not likely to change which gives extreme confidence to the composite key.

Because the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN Order
ON (Customer.CustomerID = Order.CustomerID)
INNER JOIN OrderLineItem
ON (Order.OrderID = OrderLineItem.OrderID)
WHERE OrderLineItem.Product = 'doohicky'

is much more complex than the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN OrderLineItem
ON (Customer.CustomerID = OrderLineItem.CustomerID)
WHERE OrderLineItem.Product = 'doohicky'

RevMike · 2005-12-23 Reply Admin

RevMike:

Because the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN Order
ON (Customer.CustomerID = Order.CustomerID)
INNER JOIN OrderLineItem
ON (Order.OrderID = OrderLineItem.OrderID)
WHERE OrderLineItem.Product = 'doohicky'

is much more complex than the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN OrderLineItem
ON (Customer.CustomerID = OrderLineItem.CustomerID)
WHERE OrderLineItem.Product = 'doohicky'

Sorry, I forgot the "GROUP BY Customer.CustomerName" for both.

2005-12-23 Reply Admin

This is frowned on for some main reasons:
It shows that you are unaware of what the "facts" inside each table are and you probably aren't aware that perhaps a composite key would be the PK.
It allows duplicates (you are simply avoiding the key aspect and just using an identity column to number your rows... like a sheet of columnar paper... you can write JOHN SMITH 444 STATE STREET POUGHKEEPSIE NY five times on that sheet of paper and each row is unique)
Consider a table of people which is frequently queried on by phone number (say a PBX system queries it as calls come in).

You could probably be safe with an IDENTITY column for numbering the customers, but you would want your PK to be the phone number PLUS the caller's name. This identifies a fact that is UNIQUE to each row in the table, prevents duplicates, and it's not bad design. If you believe in MSSQL, a good fill-factor and a CLUSTERED index, then it makes retrieving a unique record a sub-second operation.

A nice thing with setting your PKs right is if you feed in data from bulk information services (market aggregators like Merkle and such). You can feed in this data during the day (a program to throttle the inserts so OLTP functionality is not affected), and as constraints are dinged the exceptions can be fed into a dupe file or dupe table, which you can go back to your aggregator for a refund.

What I see most often is some ID field at the beginning that's set IDENTITY CLUSTERED, then the rest of the table is heavy with non-clustered indexes as the table gets very big. Sad, really.

For extremely large people tables on MSSQL, I have set the primary key to the phone number plus the SOUNDEX of the name, and seperate index for the SOUNDEX of the address (seperate columns are maintained to store the SOUNDEX values and are updated with triggers). This makes an address/name search on a table with 255 million+ rows on it come back with results in seconds for LIKE searches.

I beg to differ. Using artificial keys certainly have their drawback for tasks like data loads, but it is generally good design for a couple of reasons.

Key Searches on integers are generally much faster than searches on long strings.
There is no risk of exposing private information in URLs or other types of query strings (this is why using a SSN or a phone number as a primary key is generally a very bad idea).

Most databases do allow for defining additional unique constraints on other fields, so there is no need to rely on the primary key for this purpose.

olddog · 2005-12-26 Reply Admin

revMike --

Based on what you've offered, your tables might look ( in simple terms ) something like this:

Customer
{
ID                    (int) //primary key
CustomerName (str) //human readable
...
}
Product
{
ID             (int) //primary key
Product      (str) //human readable
...
}
Order
{
ID                (int) //primary key
CustomerID (int) //foreign key to Customer
...
}
OrderLineItem
{
ID                (int) //primary key
OrderID        (int) //foreign key to Order
CustomerID   (int) //foreign key to Customer-same as Order.CustomerID
ProductID     (int) //foreign key to Product
Quantity        (int)
...
}

The risk, is that the application bears the responsibility to coordinate change in the Order.CustomerID ( ie. data entry error ) with the OrderLineItem.CustomerID to keep them in sync. The purity of a single relationship would avoid this. However, I agree- YOU COULD SEE A HUGE BOOST in performance if you can safely manage the coordination. -- Do you trust your programmers?

RevMike · 2005-12-26 Reply Admin

olddog:
revMike --

Based on what you've offered, your tables might look ( in simple terms ) something like this:

Customer
{
ID                    (int) //primary key
CustomerName (str) //human readable
...
}
Product
{
ID             (int) //primary key
Product      (str) //human readable
...
}
Order
{
ID                (int) //primary key
CustomerID (int) //foreign key to Customer
...
}
OrderLineItem
{
ID                (int) //primary key
OrderID        (int) //foreign key to Order
CustomerID   (int) //foreign key to Customer-same as Order.CustomerID
ProductID     (int) //foreign key to Product
Quantity        (int)
...
}

Minor nit to pick: I'm using composite keys. The primary key of Order is OrderID AND CustomerID. The primary key of OrderLineItem is OrderID, CustomerID, and LineItemID.

The risk, is that the application bears the responsibility to coordinate change in the Order.CustomerID ( ie. data entry error ) with the OrderLineItem.CustomerID to keep them in sync. The purity of a single relationship would avoid this. However, I agree- YOU COULD SEE A HUGE BOOST in performance if you can safely manage the coordination. -- Do you trust your programmers?

So your argument against this architecture is that I need to have a multistep transaction in order to change the association of order and customer? First of all, I hope to God that I have someone on my staff that understands what happens when one turns off autocommit. Frankly, this is a trivial problem that any one who calls themself a database programmer should be able to handle. But more to the point is I would rarely if ever code such a transaction into the system in the first place. The proper way to do such a transaction is to cancel the order and reenter it under the appropriate customer. (Note that I didn't say re-key - I'll provide an automated way to do it.) It is always better to have an audit trail on these things, and changing history is not the way to have a trail.

MikeB · 2005-12-27 Reply Admin

I only want to know one thing. I got this strange bill in the mail so who should I send the payment to?

olddog · 2005-12-27 Reply Admin

revMike - Sincere thanks for the clarity. When you point out the obvious risk of multi-step transactions it becomes apparent that simpler doesn't always translate to safer. Also obvious, is my lack of experience with advanced db systems that support composite keys, auto-commit and atomic transactions. I hope you don't mind my pushing your button on this thread. I'm sure it was to my benefit. Plus the point of an audit trail is plain-ole common-sense. At some point, you'll have to prove the change.... what better than a record of the initial transaction.

RevMike · 2005-12-27 Reply Admin

olddog:
revMike - Sincere thanks for the clarity. When you point out the obvious *risk* of multi-step transactions it becomes apparent that simpler doesn't always translate to safer. Also obvious, is my lack of experience with advanced db systems that support composite keys, auto-commit and atomic transactions.

Fortunately, advanced features are becoming more widely available. DB2 and Oracle are still, in my opinion, the leaders in these areas - though many will correctly point to gaps in their usability. SqlServer has improved dramatically and deserves recognition, and MySQL has fixed many of their shortcomings and deserves a wider install base among higher end applications.

I hope you don't mind my pushing your button on this thread. I'm sure it was to my benefit.

I love to argue. :) Besides, what better way for everyone to learn if not by constantly challenging and being challenged.

Plus the point of an audit trail is plain-ole common-sense. At some point, you'll have to prove the change.... what better than a record of the initial transaction.

I cut my teeth working on a financial system where the rule was that a data element was never updated. The only time we ever ran an update statement was to add data to an field that was intentionally left null by an earlier insert. It was very nice to be able to track the movement of every dollar through that system. When a user would call and say "You broke the system! Where did my charges go?" we could say "On date XX/YY/ZZ user UUUUU transfered those charges to account NNNNNN-NNNN-NN." Ultimately, a good audit trail protects the developers from bug accusations.

lofwyr · 2005-12-28 Reply Admin

RevMike:

I have to strongly disagree with this. In multilevel master detail relationships - a fairly common construct - a composite key is much better than using several single column keys.

I beg to differ. Especially in a multilevel master detail relationship it's better to keep the references (primary/foreign keys) simple. I'll try to work on it with your enhanced example

RevMike:

Take, for example, the structure of Customer-Order-OrderLineItem. The customer is identified by some sort of CustomerID. The CustomerID is 99.9% likely artificial, although there are a few small businesses where there may be a valid natural key. The Order table has its own artificial key, as well as a foreign key relationship to Customer. The OrderLineItem table has its own artificial key, as well as a foreign key relationship to Order. This works perfectly fine for Operational Reporting. We can account for every order made by a customer, and every line item that makes up each order. We can print packing lists and shipping labels. We can adjust inventory. Everything works.

So far so good.

RevMike:

Now, lets modify that model slightly. We still have a Customer-Order-OrderLineItem structure. However, now the primary key of Order is a composite of CustomerID and OrderID. The primary key of OrderLineItem is a composite of CustomerID, OrderID, and LineItemNumber. This still works perfectly fine for Operational Reporting. We haven't limited ourself in any way.

OrderID itself is not unique anymore. If that's a requirement, your approach neglects that. EveryrReference to an order has now to include the customer id. Sure, you could add an unique constraint to OrderID.

If an order has been misplaced, you have to run update cascades through all attached tables. Chances of causing locking problems and generating I/O just moved up a bit.

Same goes for LineItem, although it is sometimes tried to provide a gap free, starting with 1, line item Number. Needless to say, that this approach can cause serialization problems. Any reference to a Line-Item will now take at least 3 (+1) columns to refer to Line-Items. Aside from update/locking problems it's causing a waste of space. DBMS, at least the ones I know of, are "born" for joining, otherwise we could have stayed with hierachical databases.

RevMike:

The difference shows up when we want to start doing simple Business Intelligence reporting from this database.

That's why they invented views. If you want to "denormalize" the data and keep everything in one "table", write a view.

RevMike:

Now we can also efficiently answer questions like "Who are the top 10 volume purchasers of doohickeys?" The first model requires that we get to CustomerID through Order, and so we need to first join Order and OrderLineItem, then perform our aggregations against this result set in memory.

That's what databases are for, so I heard. Problem is: your modell is now optimized for a certain type of question. But if I want to create a report including order date and customer name, I still have to do the joins with your approach and still carrying the extra weight of your composite keys, no savings there. Maybe your customers want to have their data modell optimized for one certain type of questions. Mine, they keep coming with some new ideas every other day.

RevMike:

The second model doesn't require the join to Order at all. The aggregation occurs against a single table. In addition, the index will typically allow the aggregation to be streamed instead done entirely within memory. We've drastically reduced the IO and reduced the required memory. For a reasonably large system, we could easily be looking at an order of magnitude increase in speed.

Well if we're talking about tuning, there are other options, such as materialized views, which, as anyone would have guessed, stem from dataware house environments, same goes for analytic functions, etc.

A three level nesting may be simple, but when you've seen data modells with 8+ fields of composite keys, you wish you've never seen it in the first place.

Natural keys: I only use them in lookup tables, because the chances there are really low that you have a) additional attached tables and b) updates on lookup values itself.

And yes, I work with large databases.

l.

olddog · 2005-12-28 Reply Admin

I don't have a problem with designing a database optimized to solve a specific need. In fact, that's the ultimate goal. However in my experience, a database is designed based on the client interview, which is most often - short sighted... At least that's what I normally get outta the interview. I tend to walk away feeling that the client is desperate, and is looking for a disposable database to solve their immediate need for a specific report. During the interview, when I bring up the possible ( future need ) for alternate functionality, the usual response from the client is " We don't need that now - let's just build what we need". So.. my response is, to build a database that will be agile enough to react ( remodel ) to the subsequent changes that are inevitably requested. The tuning and optimization comes much later ( after the big picture is revealed ). So... for me, a simple design is the best solution outta the gate.

Uniquely Addressing

Leave a comment on “Uniquely Addressing”