- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
I don't think anyone will disagree that today's WTF is a... umm... WTF. However, refuting the whole idea of artificial keys just by today's example is a damn fine demonstration of a straw man argument.(http://en.wikipedia.org/wiki/Straw_man)
Every argument against artificial keys that I have seen go like this:
1. Look at this bad example of artificial keys. Identities suck!
2. But but.. relational databases, but but!
I have yet to see someone explain to me how using identities is NOT relational. Perhaps you can tell me without telling me to go read a book by Date.
Admin
Tables of dates and/or tables of numbers ARE extremely useful and indeed quite handy for many types of SELECTs.
I am stunned, howerver, after reading close to 30 or so comments:
Where are "database design gurus" who, a few months back, told me they would OF COURSE add an "intIdentityID" column to ALL tables, including tables of Months or Days or Years or Numbers, in their database, and would NEVER use any kind of actual key derived from data? They spent many, many posts explaining to me why having an "intMonthID" value of 2 that represents "September", or an ID value of 4723 to represent the number 3, is perfectly normal and acceptable. They must be looking at this table design and thinking "looks good to me!". ...
Admin
It's 5 Freudian. No. I just really really like her.
Admin
They're probably not here because they don't exist. I am an advocate of artificial keys, but that doesn't mean that I am completely opposed to using natural keys. I just think that in a great deal of cases, artificial keys lend themselves to a better designed database.
I can't even imagine how horrid the tables would be in my current project if they all required natural keys. We have over 200 tables. Joins would be horrid especially.
Admin
Alex -- I wouldn't bother trying to convert this crowd. Many of them appear to be great procedural programmers, but in terms of database design we have quite a few members here who believe in the whole " a primary key = an identity and nothing else".
Admin
Here's the link to a previous discussion on this issue:
http://thedailywtf.com/forums/38234/ShowPost.aspx
Admin
<!--StartFragment -->
[image]
Again, I have never said "Give me identity fields, or give me death". I just think they are the rule, while natural keys are the exception.
And I am very much an object oriented programmer, it's all I've done.
Admin
Why do you suck so hard at explaining why artificial primary keys are bad relational database design? I and several other people here have asked you and the other natural primary key proponents here on more than one occassion to convince us of your position. All we get is the same old bullshit, straw men arguments and comments like yours above to Alex, every damn time. Perhaps if you cannot defend your position (acting smug isn't a valid defense) you should find a new one.
Admin
Same here, and I'm not even registered!
[pi]
Admin
Where the heck did you get your degree in database design?
Admin
Hey Uncle -- read the thread I posted a link to. Sorry for the smug comment, I tried pretty sincerely and with a great deal of effort to discuss the topic previously to no avail before.
Admin
Oh, and by the way, what did I say that was so bad that insulted you so much? If you think that an identity and a primary key are the same thing, then what is the problem? I didn't say it wrong to think that way in my "smug" post, did I?
Admin
lets not deviate from the correct way eventhough the compiler will not complain ...
"original post".Equals("straw man")
Admin
No, you are definitely one of the open minded ones here, which is great, you did seem to be intelligent about the whole thing and you seemed to thoughtfully consider both sides, and if I recall you said that you use both strategies in your design ... so, no, that definitely wasn't directed at you. Again, it's the people who say "EVERY table must ALWAYS have an identity as the primary key, there is no other way, EVER." Those are the ones who design tables like we've seen in today's WTF or have tables like this:
YearID, MonthID, DayID, Amount
2003,12,3,$30
2003,1,4,$50
Surely that means $30 on 12/3/2003 and $50 on 1/4/2003, right? Nope! Don't forget to do your lookups! Because YearID 2003 equals 1996, MonthID 12 is January, MonthID 1 is March, DayID 3 is the 28th and DayID 4 is the 12th.
Admin
200 tables? Pretty small if you ask me. :)
Admin
I definitely agree that today's WTF is an example of pure retardation, not relational design. However, this bad apple shouldn't be used as an argument against surrogate keys! That's just a fallacious argument.
Admin
I was going to post that the "'uniquifier' Primary Key" is a real holy war, but I see it's too late!
There are arguments against using them: they are an unnatural artifact; they encourage poor database design by letting you skirt really thinking about your data and what each row represents; they are a wasted column you didn't need; they force you to use joins in places you would not have needed them otherwise.
There are arguments for them: they make joins much easier (I've had much-used tables where the natural primary keys were seven or more fields, making joins slower to write and perform); they let you leave long keys out of other tables thus optimizing I/O; they are more flexible ("oops, turns out Social Security Numbers can change after all").
People feel strongly on each side, for very good reason. Many of us just try to use the best approach on a case-by-case basis.
To me, the big "WTF" is the choice of data types. They are often wastefully large, and the consistency of types ("varchar(50) for everything!") just smacks of amateurishness that I can't believe got in the door of a "large multi-national bank."
Admin
Did I say 200? I meant 200^2. That's right, we've got 2000 tables.
<FONT size=1>Ytram - PhD in Applied Mathematics</FONT>
Admin
It is relational, just not third normal. "The key, the whole key, and nothing but the key, so help me Codd."
If there is a true natural key and an artificial identity then your row fails the "nothing but the key" test and the design is not third normal.
Designers should be very careful, however, when choosing a natural candidate key as a primary key. Some things work in limited domains but fail easily has the system gets extended. For example, I once maintained a personnel system that used social security number as a primary key for person. It failed when we had to 1) add contract workers for whom we had no SSN and 2) we had guest employees who worked for a foreign sister company come to the US temporarily. Also there are a (fairly small) number of people who have been issued duplicate SSNs.
Admin
Can all wanna-be-programmers please stop prefixing stuff??? Prefix tables with 't' etc?!?!?... I don't call your children cPeter and cDiana... Or your wife wSandra???? Even worse: call the ex-wife: xDonna.... Please....
Admin
You are trying to troll me. I sniffed at the bait, but am going to swim away before you set the hook.
Admin
(waves paw)
Bah.
Storing years and months and such in separate tables with ID fields is beyond retarded, and I suspect that problem would be far from the only design flaw in the db schema. If that's the strongest case for mocking ID fields, that's pretty weak.
Artificial keys should be used almost all the time as long as the rest of your table makes sense. If you're storing months and years in lookup tables, then you should look for a non-IT job (not you you, you in general).
At any rate, I hope the lookup tables in today's WTF include
tCharConversion
fId fChar fUpperCaseChar fLowerCaseChar fNormalizedChar
1 a A a a
2 A A a a
3 b B b b
4 B B b b
...
51 z Z z z
52 Z Z z z
Admin
Give me a break, throw out a non-utteraly fallacious argument, and I won't throw out the straw man card. Arguing against ID columns is ok with me (Hint: I never proclaimed to be an ID column advocate). However, to make up your own invalid argument, such as this rediculous example and claim that it has any bearing at all for the pro ID crowd is complete BS. It certainly does not constitute a "valid defense".
Admin
Actually it would have to be xwlpszDonna.
Admin
I suppose I consider my self in the "Identities as primary key, generally" camp. I agree with the points you made about arguments against them for the most part. What do you mean though when you say they force you to use otherwise unnecessary joins? What would you select on otherwise?
To me, the places where standalone IDs aren't necessary are certainly in many-to-many relation tables, and in any kind of lookup situation (3 - March) or (87 - Regular Unleaded) etc, which is a direct correspondence between a value and some descriptor for the value.
The rest of the time though I certainly think the benefits outweigh the costs, especially when the data has a high degree of duplication of values. I deal a lot with "Person" records, so to me joining on First, Last, & Middle Name, Job Title, Employer... etc seems totally unmanageable. Using SSN's is a possibility but with the recent privacy laws and all, good luck if some encryption requirement comes down.
The other thing that makes IDs nice in an OO world is the ability to instance any record as a property collection just by passing a table name and ID value. "Give me ArbitraryTable record 123" is just incredibly powerful when you're dealing with a generic object loader.
Admin
This is even more true when you are dealing with a OR mapper.
Admin
You and your fallic jokes. When will you admit your love of men to yourself?
Am I the only Aqua Teen fan that laughed because it sounded like "Oh you and your third dimension. It's cute. We have five. <pause> Thousand, yes five thousand."
Admin
Did anyone else trip on this? Let's take a trip across business lines ...
You need to learn the fundamentals of something before you can understand why what you are doing is fundamentally wrong.
Admin
I see you are moving beyond the straw man fallacy to the "Appeal to Authority" fallacy. Is it really that tough to make a valid logical argument?
Admin
There are cases where unique artifical identifiers make sense, and there are cases where they don't.
There are cases where natural idenfiers should be used, and there are cases when they should not.
The real WTF in this thread is anybody who thinks that these *MUST* be one or the other.
Admin
Now repeat after me:
"I am sofa king."
"wE tODD eD."
Admin
If the tables look like this, then what does the usage code look like?
(There. That ought'a give folks ample opportunity to drop in a language flamewar, a few WTFs about the different flavo(u)rs of SQL, some comments about profiling before optimizing your tables, and maybe a reference or two to how wonderful the ternary operator is to process the "True", "False", and "File Not Found" entries (the lattermost of which is not currently found ... how appropriate) in the tBoolean table. Not to mention a few comments that "first" isn't. And "brillant" isn't.)
Does the usage code at least (for example) do a table lookup to convert the year 1960 to the YearID 61 (and back again)? (In which case, it'd be fubar but who cares if the table is even sorted?)
Admin
This is why Views were invented, my man. :)
Admin
I'm still waiting to hear an argument against identities from someone who wasn't just trying to make themselves appear to be enlightened in the zen of database design. The 'arguments' always seem to be of the angry-significant-other kind: "Well, if you don't know what's wrong, I'm not going to tell you!" or the I've-been-developing-for-200-years kind: "You try using artificial primary keys in an Ultra Enterprise system and see where that gets you, sonny!"
There are good logical arguments for using either approach, but there seems to be some kind of snob value in avoiding artificial keys, perhaps due to the seeming 'purity' of a 'naturally' keyed system.
There's little point arguing the toss here.
Admin
Interestingly, this is almost a good design. Yes, you can re-label years and such. So that you could take the design to a different customer and if they psychoticly wanted all the years to include their designations in the chinese calendar system, you could.
By using integers instead of words, you can easily sort by them correctly - AND you can group data by months and days by using broken apart dates. Most SQL databases have SQL data types but handle them so badly that broken out months are a neccessity anyways, and once you do that, you pretty much have to convert them to display names yourself.
Unfortunately, that level of thought probably wasn't used here, since his system lets him use the infamous third (or more) boolean value.
Still, flexible and handly.
To those that think that using integer IDs for rows is a bad idea, they beat the hell out of using strings for indicies when it comes to performance. Not that every table needs a unique single ID for each row, sometimes you just HAVE to combine 2 or 3 natural non-character string columns for simplicity.
Admin
I guess one could argue that the tri-state checkbox was invented for exactly this purpose. I do, however, expect that the tri-state checkbox's life will end with the proliferation of internet forms that all use drop-down combos with FOUR options for this: "Yes", "No", "N/A" and "<Pick an option>".
Admin
In case anyone didn't realize, this is essentialy what the "Appeal to Authority" fallacy means.
Admin
My car is bigger.
Admin
Actually, the idea of a separate object identity is very much an object oriented one. It was only really introduced into database land by the OODBMS concept:
The Object-Oriented Database System Manifesto (relevant page)
Personally, I am opposed to natural keys. Well, not really opposed, it's more like I'm disappointed by them. I've tried them, but in practice you always end up with difficulty referring to a table (especially with difficult, multi-column primary keys), or having to update hundreds of referring records if the primary key value of some record happens to change. Also, natural keys lead to redundant storage of the most important data of a record. Most important property of redundant storage: if you update them in the primary storage place, you have to update them in other places as well. Makes it pretty bloody difficult to update those fields of a record then. Also, redundancy is usually abhorred by database design purists when you're talking about denormalized databases, but you can hear these same people defending to choose "name, address" as primary keys of tables to which many tables refer. And in such a case, it is usually wiser to use SSN as an ID anyway, as that is more unique and doesn't change. The SSN is an artificial key if I ever saw one. :)
Admin
As was pointed out above one of the major simple arguments against artificial keys is Codd's (the guy that got us all hooked on relational databases) insistence that there be nothing in that database but real data of real use in describing the data. Thus any 'magic' identity column you tack onto a tuple that doesn't have a real world meaning breaks the model and allows you to put two otherwise identical tuples into your table which then allows you to alter one and not the other and then you have all the data redundancy issues that relational databases were supposed to save us from. Now at this point you can hop up on your high horse and say that you could constrain your columns to not allow data redundancy. That is true but now you've added two artificial properties to your data, you've added an artificial key that is meaningless to the data itself and you've had to add an artificial set of constraints to your table in order to preserve the data integrity that would have already been protected by the system had it been used properly in the first place. So in reality you've actually complicated your DB (+1 column, + what ever is built to support your 'fake' constraints) in order to make your life as a programmer simpler. It also offends the OO and Relational purists in that you are exposing implementation details. According to Codd if you want information about purple elephants then you ask the DB for information about purple elephants and there should never ever be a way to get information about purple elephants without asking for information about purple elephants or some superset that includes purple elephants. If you have artificial keys then you can ask for some nosense like 123 and get information about purple elephants... now what the f* does 123 have to do with purple elephants nothing except in your particular implementation of the DB. Now your going to have code that is asking the DB for 123 and it won't make sense where as code that asks the DB for purple elephants will.
I see I've begun to ramble, anyway to make a long story short there is probably not a really nice pat answer to your question. Any explaination as to why fake keys are bad is going to go into deep theory and what not because finding the best way to represent data is a 'hard' question that hasn't really been solved even now, though we are doing much better than we used to when we were stuck with Flatfile and CODASYL systems. Also every real world DB system has all sorts of crutches to allow real programmers to abuse the relational model to get data into it how they want and use it the way they want so you can 'work around' most of the problems of breaking the relational model and many people just expect this. Things like non-key constraints, 'like' operators, stored procedures and what not all break the model but people need and use them every day.
Admin
Oh, but that's easy. It didn't have to enter through the door. Why? Large multi-national banks usually still run large amounts of COBOL code. Using files instead of databases. Using files with fixed-length records. It only takes a team of re-trained COBOL programmers to come up with this, nothing more. And they already have those. Trained in-house, never came in through any door!
Admin
Well I consider myself very camp but I don't go boasting about it.
Sincerely
Richard Nixon
Admin
That last one ties into something that Handyman mentioned:
I've seen this in practice. Most of my work is with a package that mostly uses natural keys; you need special utility programs to renumber a natural key in all files at once, plus if you have any custom tables then you need to inform the utility of them as well. A little of my work is with a higher-end package that mostly uses artificial keys; the human-readable key is stored in exactly one table, so changing one is near-trivial.
Also, not only is "all keys are ints" efficient for lookups and joins, but it's much easier to remember when designing new tables.
Admin
Good stuff. That's one of the same points I made in the thread mentioned earlier, one which is commonly ignored.
To simplify: You need a table of U.S. states. You create "StateID" as PK. One day, we get two entries for "North Dakota". So we add a constraint saying only 1 entry per state name. Then someone enters "N. Dakota". More problems. Now what? Looking at the data, we realize we can add a constraint saying only 1 entry per 2-char abbreviation. Now we are feeling good. And, of course, this all could have been avoided if we just declared the PK as the standard 2-char state abbreviation in the first place, and now we have extra columns and extra constraints in our table, and a meaningless "StateID" value all over our place in which a join is required just to print the 2-char abbreviation.
The most common thing I hear is "Joins are harder/more complicated to write when you have multi-column PK's." Ironically, often you need LESS joins in your design if you store natural keys, especially mutli-column keys on junction tables and the like.
Another one I see often is "OK, then, what is the natural PK of an address table? It must be all columns!" no ... it's whatever the PK of your entity is, plus an addressType identifier if your entities can have more than 1 address. (and/or maybe a "status" of some sort) Storing more than 1 address for an entity without explicitly classifying the difference between the two is completely, uterly pointless, and something that identies often not only allow but encourage. But if I force you to have exactly 1 "billing" address, and 1 "home" address, and so on, by forcing you to create or assign types for each one you add, then your data has meaning and your address table has a perfectly natural, intelligent primary key.
Admin
The funny thing is, in an artificial-key database where an address can have a primary key of its own, the foreign key would probably not be part of the address, but of the person / company / dog / institution that the address belonged to. The nice thing about that is that you can use a single table (address) to represent the same thing (an address) for multiple types of entities (person/company/dog/institution). And any entity type can have zero, one or several separately specified address types: one foreign key for each one. A very nice feature as well: some of the columns may be NOT NULL, so that you may specify that a dog MUST have a home address. :)
Admin
Using "id" everywhere is indeed a WTF. <table>Key (CustomerKey, EmployeeKey, etc.) is a good pattern.
Failing to identify the natural key and enforce its uniqueness (whether or not you use an artificial key) is also a WTF.
Now some stuff from the linked thread, since it seems a good time for a refresher:
This is generally a good case for natural keys, yes. State abbreviations change rarely enough for natural keys to be a better choice.
Artificial keys are fine here. What if you later want to change that unique code? (Another thing about state abbreviations - they're based on a global standard, whereas customer/employee/etc. keys follow whatever standard the current package/company impose.)
Artificial keys are fine here, too. Moving accounts around is rare (unless someone screwed up the original chart), but What if you later want to add some digits to all your account numbers (e.g. you go international, and want an extra segment corresponding to countries)?
Any table whose natural key is entirely made up of foreign keys (regardless of whether those foreign keys are themselves natural or artificial), the natural key is probably fine.
Blindly doing anything is a recipe for WTFs up the wazoo. You do need some "common" sense.
This should only occur in things like reports customized for a particular user, but anyhow:
Status = (select StatusKey from Status where Status = "A") etc.
Admin
Meh. Database normalization taken too freakin' far. This is like what, the 7th, 8th or 9th NF (not like the latter two exist), where absolutely everything must have a unique reference OR is excluded from the DB?
...I don't know why people bother beyond 2 or 3NF. Anything more and databases just become a collection of UIDs instead of a data store.
Admin
200 ^ 2 != 2,000;
200 ^ 2 == 202;
Admin
Well that sounds to me like in your database I would never be able to have an unordered set of anything. What about a 1->many where the many doesn't have a logical "type." I have three cans of tomato soup in my cubbard, do I really care about differentiating them? Not tonight I don't. And I don't want to add a "sequence" column either. I have three cans of soup in no particular order. And I like it that way. Sometimes. Geesh this is stupidest thread EVAR.
Admin
It doesn't appear to be a straw man, since it seems that Alex is a passionate advocate of natural keys. Either he thought "everyone must prefer natural keys" (unlikely) or he thought "this will really stir the turd", in which case he's simply trolling. hmm...maybe Alex's initial post should have a "troll" button...
Anyway, all thoeretical points aside, I advocate the use of meaningless, artificial keys unless there is a very good reason not to use them. I find that (in the "real world") you get into far less trouble with PKs that have no other semantics and simply provide uniqueness: tables are smaller (which usually indicates better normalization), keys are smaller and easier to manage (for code & people), you don't run into "changing the PK value" problem*, etc.
As for the date part tables: Data Warehouses and other OLAP designs are intentionally denormalized. As part of these designs, there is typically a table dedicated as a temporal data dimension. It has such columns as Year, Month, Day, DayOfWeek, IsWeekday, IsHoliday, etc. This makes crunching vast amounts of data which is dependant on dates & times much, much easier. It also has a single integer identity PK, which only provides uniqueness. Maybe this is an attempt at something like that (although they missed the mark, they hit the WTF bullseye).
*The canonical example of this is a bank using AccountNumber as the PK and printing it on a person's statements. Works fine until you merge with another bank. Then if the AccountNumbers have differing formats, (or worse, overlap) you're in for a big rewrite.