- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Admin
<font size="1" style="font-family: verdana;">
<font size="2">This to me seems backwards. Isn't it more likely that the average person would forget the directional rather than use the wrong suffix? But what do I know, I can't comprehend the collective mind of the USPS.
Whichever bureau/department/what have you is in charge of street naming really should try harder to avoid these kinds of duplication. I recently went looking for the town hall in the town I just moved into. The address was something like 225 Main St., or so I thought. So I drive back and forth on Main St. a couple times, from the end, and only see 100-level numbers and lower, which are decreasing. Turns out the actual address was 225 E. Main St., which begins where Main St. "ends" (logically only, it's the same physical street), with the numbers increasing. Now, E and W I can sort of understand, if there's an obvious boundary like a major road or train tracks or whatever. But E and nothing, where it switches at some random intersection? WTF?
</font></font>
Admin
I agree with what you're saying for the most part but again I'd point out that the same thing can be accomplished using an artificial key. What eliminates the joins is that you've made a second copy of some of the data from the parent table in each row of the child tables, effectively you've denormalized the schema somewhat. Performing that denormalization is a separable issue from how you define your primary key. Nothing prevents you from denormalizing for performance just because you use an artificial key it just happens to be a side effect of using a natural key.
Admin
@cconroy: We actually have a similar thing in the UK. There's a Chadwick Street and a Chadwick Street South, though from the road layout in that case it's obvious they're different.
Personally, though, in the example you gave, I reckon that Main Street, East Main Street and West Main Street should all be the same street. The directional should be there only to give you a clue which part of the street you're looking for, since in the US streets can get very long. So if they say "123 E Main St" then you know that it's going to be on Main Street, somehwere towards the East end of the street.
Admin
Ooh. That's fun.
Admin
I think the entire argument boils down to this: there are very few panaceas in database design. There are only models, and its up to your planning (and some good generalities about the laws of physics) to implement the best model that fits the application at hand with the database engine you bought (or downloaded). The trick is in properly justifying your design decisions and testing your theories BEFORE you apply them (and are stuck with them).
It's very hard to fix a crufted database after it goes into production and starts to see heavy production use. I've done DB transitions, and those projects are long, hard and complicated.
Admin
Are we looking at a new WTFism?
Admin
Try to get records from one date to another.... okay. I'll give it a shot. How about all the records in October of this year?
SELECT * FROM appropriate_table WHERE date >= 20051001 AND date < 20061101 ORDER BY date;
Do I win?
Admin
Phone number and name is a good PK?
I guess your're not familiar with John Smith and his son John Smith who both live at the same address with one phone number? <sarcasm>No, nobody would ever name their kid after themself.</sarcasm>
Admin
Not to mention leap years and the odd leap second. For anyone who's concerned about keeping their clock sync'd with the Earth's rotation, timekeeping isn't quite so cut-and-dried. Granted, for almost all business cases, leap seconds can safely be ignored. About the only people who really care are astronomers.
Admin
And I've had mail delivered to me by the USPS for *way* more than two years and I can assure you that they only care about court, street, avenue, etc. when there's a reason to. So there!
Honestly, these types of "corrections for corrections' sake" posts sound like kids squabbling. Arg.
James' point is that using a user-typed string as a primary key is troublesome - at my last job they had a table that was keyed on "first name + last name" and in it there were plenty of duplicate entries based on "John O'hara" = "John O'Hara" = "J Ohara" = "Jhon O'Hara" etc. So much easier to have a lookup table with standardized names keyed by an id ...
IMEO, of course.
Admin
I had a similar but more insidious bug not too long ago. The repro steps were just to open a file on a web server; the request redirected to the actual file on another server. Fair enough. It was failing approximately 4/5 times with a nonspecific download error. Turns out that the web server (an internal one we stock with a variety of sample documents) was doing some database-backed auditing with a script that used a time offset in seconds as (part of?) the primary key (I presume it was something like (originating address or user, page requested, time)). Two requests in the same second caused an internal server error on the second: ODBC error, duplicate primary key.
Tracking it down was a lot of fun (not really). Single-stepping in the debugger always succeeded, because it would be more than a second between requests (there were multiple requests made because of an IE library beyond our control); similarly it sometimes worked because the two requests would cross a second boundary. I ended up putting a lot of debug output statements in the code to narrow it down, watching the network traffic with NetMon, and then testing my hypothesis with a perl (LWP++) script and then in the browser. Clearly this wasn't our fault, and was resolved that way.
Admin
ha ha ha ha. No. Try your dates again.
Admin
> i feel i need to definitely study more about databases. is there a good book?
Data Modeling Essentials, 2nd edition by Graeme Simsion.
Admin
If you don't mind me asking, what's wrong with storing dates as integers? A common format I use is YYYYMMDD - dates sort nicely, it's easy to constrain this format, and getting a date range is a simple greater/less than question.
EG, for all of 2005: ...WHERE date >= 20050101 AND date <=20051231
Or, if you use unix epoch, to get everything within a date range, just divide by 86400.
The only negative is that you can't get +- NN days this way - but it's not hard to precalculate that before you submit the SQL statement.
So why would this be a WTF again?
PS: I manage finance-related data in the millions with my DB apps - and so far, these methods for keeping dates has always been spot on...
Admin
Toronto, ON has an Avenue Road.
Vancouver, BC has Kent Avenue North and Kent Avenue South on either side of the rail line. East and west of the meridian have to be indicated so, there is
E. Kent Ave. N.
W. Kent Ave. N.
E. Kent Ave. S.
W. Kent Ave. S.
This is right by the South Foot of Main Street that I already mentioned.
Sincerely,
Gene Wirchenko
Admin
Ba! Ha. Hee.
Oops.
[:$]
(Originally, I was going to make the example December of this year, but then, for some asinine reason, I went, "No - December's not done yet. Do a month which is over." And so I changed the 20060101 to 20061101, instead of 20051001. Which is all by way of saying, I am a big dork.)
Admin
I've seen this quite often. finance-related data + this particular idiom screams Sybase at me. I've been told that some older Sybase version did a particularly bad job of indexing and searching on date columns. This was a reasonable work around for a flaw in the RDBMS.
I wouldn't suggest designing a system like this now.
Admin
My wife and I have different last names but the same phone number. If we (independently) purchased something from this business, they'd still need to call the dumba$$ developer to "uniquify" the his idiotic key.
Admin
<font size="1" style="font-family: verdana;">
<font size="2">I agree. In my case, the entire street (well, the "Main" section of it -- it actually changes names a couple more times) is maybe two miles long. Probably overkill to "divide" it. But if you're going to call one half E, at least name the other section W!
This is minor compared to the numbering, though. In case I wasn't clear, the numbers start at 160 or so on one end, go down to zero, then increase again when it switches from E. Main to Main. So it's not the case for the 225 address I was looking for, but it is entirely possible that there are two 123s, one on Main and one on E. Main. This can get confusing really fast, especially if you're looking for a non-obvious building like I was (a house or small shop, say).
</font></font>
Admin
<font size="2">...er, "not looking for an obvious building like I was". Got my thoughts scrambled there.
</font>
Admin
Quoted for truthery.
Sometimes using artificial keys makes programming slightly harder or slightly slower (another join to get displayable data), but I can't think of any situations that using an artificial key would be bad design, or lead to horrible problems. I can think of lots of cases that badly using natural keys would.
Admin
Generally I aggree, but here are some counter-examples:
Bad design:
Consider a database table that keeps program parameters, the kind of parameters you might
otherwise put into a .properties file or the Windows registry.
create table params ( id number /*autoincrement*/, name varchar(40), val varchar(200) );
In your program, you would of course retrieve this parameters by their name, not by their artificial ID. So in fact, "name" ist the key and "id" is just a totally useless appendix.
Horrible problems:
Consider a database table that is not maintained within the system, but regulary completely deleted and imported from an external source. For example, imagine you run some sort of franchise and the pricelist comes every month from the headquarters. If you accept their key as the natural key, it's easy: truncate table, import file. (For the sake of the argument, let's assume nothing is ever deleted from the pricelist, and nothing ever goes wrong ;-) The only premise is that you may not declare referential constraints in the database (not a big problem).
On the other hand, if you introduce artificial keys, and reference this table by this keys, it's much more difficult, since you cannot simply throw away the old contents - after the next import, your artificial keys would not match existing data. You must match the old pricelist with the new file - much more complicated. (Well, maybe not horrible, but lots of work anyway - both in programming and runtime performance)
Admin
The issue isn't really artifical key vs natural key. Its that most people (and the discussions on here proves it) HAVE NO IDEA WHAT A GOOD NATURAL KEY LOOKS LIKE.
Here are some things that are NOT good natual keys:
Names
Street Addresses
Phone Numbers
Email Addresses
Social Security Numbers
Natual keys have to be unique (thats always unique, not mostly unique), every row has to have one and the value must never change for that row. If it doesn't meet this critera then its not a good natural key. Use an artificial one instead.
Generally the only good natural keys are account numbers or similar assigned values. And even then using an artificial key wouldn't hurt.
And for you people storing dates in integer values WHEN YOUR DATABASE ALREADY HAS A PERFECTLY FUNCTIONAL DATE TYPE, stop it. Really, just stop it. You are just hurting yourselves and everyone around you. If you are using a database that doesn't have a date type then go download a real database that does have one.
Admin
gotta agree with you. In fact, it's doubtful that any natural key is truly unique in real life. There's always the possibilty of natural duplication. A simple incremented value seems the best key to me.
Admin
I'd also note that not all minutes have 60 seconds.
Admin
Very few data values make good natural keys. And even if they make a good natural key now they may not in the future. Companies merge, account numbers that were unique aren't anymore, stuff like that. Using natural keys buys almost nothing and can lead to pounding ones head on a wall later.
One more important database design rule, kids you might want to right this one down...
Combining several crappy natural keys does NOT make a good natural key, it just makes a longer, less efficient crappy natural key.
Admin
good point runtime. Natural keys should not be a designed as a short-cut for searching. Let the db do the work.
Admin
Just to be clear... Is it ok to use a composite key as a shortcut to search?
Admin
rev - I should retreat. If that works with your db - great. Depends on what you expect to retrieve. If you're expecting a single record or many records, that depends on the query( assumes some sort of SQL ). Seems to me, the whole point of a key is it's uniqueness, composite or other wise. If you're searching for a unique record, you'll need a unique key, however you define it.
Admin
That depends on what you mean...
Creating a composite primary key on a table using some natural key values that you hope are unique... BAD
Creating a composite index on a table in order to speed up a search for a few columns thats done a lot.... GOOD.
A database is not a hash table. You can efficiently search a table based on columns other than the primary key just as quickly by creating an index. Sure there is a tradeoff for creating another index, but thats nothing compared to the stygian hellspawn of a database schema you create by using composite natural keys hoping that the queries will go faster. It really doesn't take long before you start making compromises like in the original project. And then you end up on here.
Admin
I couldn't agree more about the value of indexing. The point I was making originally was that numeric keys ( I didn't make this clear ) are faster than char keys ( as in the wtf ). And that as a PK they work well in an RDMS, specifically during joins. It's just been a goal of mine ( in my databases ) to reduce any foreign key to a number. It's proven to be fairly quick.
Admin
You responded around my question, but did not respond to my question. If you look back through this thread you'll see that I'm not ignorant of the issues inherent in using a natural key. My actual statement was that "it is never bad to use an artificial key; you can sometimes do better with a natural key, and will frequently do worse." Setting aside the artificial versus natural key discussion, do you consider it a good or bad practice to use a composite key?
I have seen plenty of crippled database designs because data architects failed to properly understand composite keys, requiring useless and costly joins when reporting upon multilevel master/detail relationships.
Admin
Integer primary keys are important for Sybase/MSSql, but not important for Oracle or DB2.
Admin
rev - if you define a compsite key as "two or more columns, designated together as a table's primary key." then, I'd say yes, it's uniqueness is "possible", but not a certainty. I suspect we are at odds about query response time. Which you are right. A search, by nature is usually not for a specific PK but for the contents of ( one or some ) columns within the record. Hence - an indexed ( concatenation ) of columns would be surely speed up the search.
Admin
Rev, sorry I misunderstood your question. In my opinion using composite keys is wrong in most cases. I'm sure there are cases where its the right answer. So the answer is its depends on whats being modelled.
I was specifically referring to the case where people take two bad natural keys (like name and phone number) and try to combine those into a primary key. Thats always a bad idea. If neither key was a good natural key on its own then combining them isn't going to help.
Admin
I am a web designer and PT Cold Fusion developer (which means I only SLIGHTLY understand most of this thread).
The reason I am posting is to tell all you guys that have you given me a big 'eye-opener" as to what you folks go through.
* Anonymous bows down to the DB developers. You guys frigging rock!
Admin
I reordered things a bit...
You are absolutely correct on this. Bad natural keys can't be combined into a good primary key.
I have to strongly disagree with this. In multilevel master detail relationships - a fairly common construct - a composite key is much better than using several single column keys.
Take, for example, the structure of Customer-Order-OrderLineItem. The customer is identified by some sort of CustomerID. The CustomerID is 99.9% likely artificial, although there are a few small businesses where there may be a valid natural key. The Order table has its own artificial key, as well as a foreign key relationship to Customer. The OrderLineItem table has its own artificial key, as well as a foreign key relationship to Order. This works perfectly fine for Operational Reporting. We can account for every order made by a customer, and every line item that makes up each order. We can print packing lists and shipping labels. We can adjust inventory. Everything works.
Now, lets modify that model slightly. We still have a Customer-Order-OrderLineItem structure. However, now the primary key of Order is a composite of CustomerID and OrderID. The primary key of OrderLineItem is a composite of CustomerID, OrderID, and LineItemNumber. This still works perfectly fine for Operational Reporting. We haven't limited ourself in any way.
The difference shows up when we want to start doing simple Business Intelligence reporting from this database. Now we can also efficiently answer questions like "Who are the top 10 volume purchasers of doohickeys?" The first model requires that we get to CustomerID through Order, and so we need to first join Order and OrderLineItem, then perform our aggregations against this result set in memory. The second model doesn't require the join to Order at all. The aggregation occurs against a single table. In addition, the index will typically allow the aggregation to be streamed instead done entirely within memory. We've drastically reduced the IO and reduced the required memory. For a reasonably large system, we could easily be looking at an order of magnitude increase in speed.
Admin
Good DB developers can rarely write decent application code, and decent application developers rarely are good at DB development. It really is a completely separate discipline.
After 7 years of developing for a really interesting database application (a financial system spread across 90+ geographically diverses nodes with selective replication, running on Oracle 6 on VAX/VMS), I selected to be trained in Java. I took to it very well, but it took months before my object models stopped looking like normalized schemas.
I now work for a software vendor that sells data integration products. I need people who can write DB and application code. I'll go through 50 "pre-qualified" resumes at a time before I get someone who can describe both the Java thread model and query tuning.
Admin
rev - I don't see the payoff for the construct. If you needed "Who are the top 10 volume purchasers of doohickeys" , why not query SUM(OrderLineItem) WHERE doohickey=x GROUP BY CustomerID
Although... the slick part of the construct is that artifical keys are not likely to change which gives extreme confidence to the composite key.
Admin
Because the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN Order
ON (Customer.CustomerID = Order.CustomerID)
INNER JOIN OrderLineItem
ON (Order.OrderID = OrderLineItem.OrderID)
WHERE OrderLineItem.Product = 'doohicky'
is much more complex than the plan for this:
SELECT Customer.CustomerName, SUM(OrderLineItem.Quantity)
FROM Customer INNER JOIN OrderLineItem
ON (Customer.CustomerID = OrderLineItem.CustomerID)
WHERE OrderLineItem.Product = 'doohicky'
Admin
Sorry, I forgot the "GROUP BY Customer.CustomerName" for both.
Admin
I beg to differ. Using artificial keys certainly have their drawback for tasks like data loads, but it is generally good design for a couple of reasons.
- Key Searches on integers are generally much faster than searches on long strings.
- There is no risk of exposing private information in URLs or other types of query strings (this is why using a SSN or a phone number as a primary key is generally a very bad idea).
Most databases do allow for defining additional unique constraints on other fields, so there is no need to rely on the primary key for this purpose.Admin
revMike --
Based on what you've offered, your tables might look ( in simple terms ) something like this:
Customer
{
ID (int) //primary key
CustomerName (str) //human readable
...
}
Product
{
ID (int) //primary key
Product (str) //human readable
...
}
Order
{
ID (int) //primary key
CustomerID (int) //foreign key to Customer
...
}
OrderLineItem
{
ID (int) //primary key
OrderID (int) //foreign key to Order
CustomerID (int) //foreign key to Customer-same as Order.CustomerID
ProductID (int) //foreign key to Product
Quantity (int)
...
}
The risk, is that the application bears the responsibility to coordinate change in the Order.CustomerID ( ie. data entry error ) with the OrderLineItem.CustomerID to keep them in sync. The purity of a single relationship would avoid this. However, I agree- YOU COULD SEE A HUGE BOOST in performance if you can safely manage the coordination. -- Do you trust your programmers?
Admin
Minor nit to pick: I'm using composite keys. The primary key of Order is OrderID AND CustomerID. The primary key of OrderLineItem is OrderID, CustomerID, and LineItemID.
So your argument against this architecture is that I need to have a multistep transaction in order to change the association of order and customer? First of all, I hope to God that I have someone on my staff that understands what happens when one turns off autocommit. Frankly, this is a trivial problem that any one who calls themself a database programmer should be able to handle. But more to the point is I would rarely if ever code such a transaction into the system in the first place. The proper way to do such a transaction is to cancel the order and reenter it under the appropriate customer. (Note that I didn't say re-key - I'll provide an automated way to do it.) It is always better to have an audit trail on these things, and changing history is not the way to have a trail.
Admin
I only want to know one thing. I got this strange bill in the mail so who should I send the payment to?
Admin
revMike - Sincere thanks for the clarity. When you point out the obvious risk of multi-step transactions it becomes apparent that simpler doesn't always translate to safer. Also obvious, is my lack of experience with advanced db systems that support composite keys, auto-commit and atomic transactions. I hope you don't mind my pushing your button on this thread. I'm sure it was to my benefit. Plus the point of an audit trail is plain-ole common-sense. At some point, you'll have to prove the change.... what better than a record of the initial transaction.
Admin
I love to argue. :) Besides, what better way for everyone to learn if not by constantly challenging and being challenged.
I cut my teeth working on a financial system where the rule was that a data element was never updated. The only time we ever ran an update statement was to add data to an field that was intentionally left null by an earlier insert. It was very nice to be able to track the movement of every dollar through that system. When a user would call and say "You broke the system! Where did my charges go?" we could say "On date XX/YY/ZZ user UUUUU transfered those charges to account NNNNNN-NNNN-NN." Ultimately, a good audit trail protects the developers from bug accusations.
Admin
I beg to differ. Especially in a multilevel master detail relationship it's better to keep the references (primary/foreign keys) simple. I'll try to work on it with your enhanced example
So far so good.
OrderID itself is not unique anymore. If that's a requirement, your approach neglects that. EveryrReference to an order has now to include the customer id. Sure, you could add an unique constraint to OrderID.
Same goes for LineItem, although it is sometimes tried to provide a gap free, starting with 1, line item Number. Needless to say, that this approach can cause serialization problems. Any reference to a Line-Item will now take at least 3 (+1) columns to refer to Line-Items. Aside from update/locking problems it's causing a waste of space. DBMS, at least the ones I know of, are "born" for joining, otherwise we could have stayed with hierachical databases.If an order has been misplaced, you have to run update cascades through all attached tables. Chances of causing locking problems and generating I/O just moved up a bit.
That's why they invented views. If you want to "denormalize" the data and keep everything in one "table", write a view.
That's what databases are for, so I heard. Problem is: your modell is now optimized for a certain type of question. But if I want to create a report including order date and customer name, I still have to do the joins with your approach and still carrying the extra weight of your composite keys, no savings there. Maybe your customers want to have their data modell optimized for one certain type of questions. Mine, they keep coming with some new ideas every other day.
Well if we're talking about tuning, there are other options, such as materialized views, which, as anyone would have guessed, stem from dataware house environments, same goes for analytic functions, etc.
A three level nesting may be simple, but when you've seen data modells with 8+ fields of composite keys, you wish you've never seen it in the first place.
Natural keys: I only use them in lookup tables, because the chances there are really low that you have a) additional attached tables and b) updates on lookup values itself.
And yes, I work with large databases.
l.
Admin
I don't have a problem with designing a database optimized to solve a specific need. In fact, that's the ultimate goal. However in my experience, a database is designed based on the client interview, which is most often - short sighted... At least that's what I normally get outta the interview. I tend to walk away feeling that the client is desperate, and is looking for a disposable database to solve their immediate need for a specific report. During the interview, when I bring up the possible ( future need ) for alternate functionality, the usual response from the client is " We don't need that now - let's just build what we need". So.. my response is, to build a database that will be agile enough to react ( remodel ) to the subsequent changes that are inevitably requested. The tuning and optimization comes much later ( after the big picture is revealed ). So... for me, a simple design is the best solution outta the gate.