- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
But if you have a new record, you might possibly need the relationships that are currently associated with "HORE". I could see your response applying to an entity with transactions associated with it(such as customer -> orders, or employee -> sales), but what if the relationships are necessary for the proper functionality of the entity?(ex. user -> permission settings) If you need to reproduce the relationships for the new entity, that can be just as troublesome as a cascading update.
However, if you use a surrogate key, you can quickly and easily update the value that was incorrectly entered by the user without worrying about reproduction of relationships or cascading updates.
Admin
That makes a lot of sense. I dunno why I started thinking of addresses as entities that needed to be related to people rather than merely attributes of people, the latter of which will, in most situations, be the case. Regardless, thanks for clearing that up for me.
Admin
Yes. But be carefully about being seduced by near-keys like SSN. If you have any doubt, it is safer to use an artificial key.
I'm not exactly sure what you mean here. If a column is only dependent on part of the key, then that column should be normalized into a different table. When you say "rows that share part of a key also share some data" do you mean they join to the same master in a master detail relationship? This is a natural consequence of composite keys. I don't see any reason that this situation should trigger the use of an artificial key. Don't forget that there is nothing invalid about creating a composite key with both natural and artificial elements. If you have a one-to-many relationship and there is no distinguishing characteristic to the many side, you can use the foreign key to the parent along with an artifical distinguisher to uniquely identify each row.
This is a very common design pattern when you have a master-detail-[subdetail[-subsubdetail]] structure. Each table is keyed with a foreign key to its parent and a distinguisher of some sort.
That is a good goal. I'm sorry I can't help. I learned through an apprenticeship with some good designers many years ago.
Admin
On the other hand, in my experience the amount of data that needs to be modified for these "relationship cases" like user->permissions tends to be much smaller (and simpler in structure) than transaction relationships. Therefore it should be fairly trivial to modify these by hand.
Just so you realize, I'm arguning at this point just for sport. I don't strongly disagree with your point.
Admin
I will see if I can track down a good book that isn't all theory; that's the problem -- it seems most are either all theory or not enough.
>>If I understand this correctly then wouldn't at least part of each of your composite keys always exist in a parent table as a primary key?
As a "rule of thumb", that is a pretty good way of thinking about it. Otherwise, the composite key is just a set of attributes, and indeed it may be clearer,easer, and/or more efficient to put a surrogate key on the table. But some of those attributes are actually entities, those entities and all info dependent on those entities should be stored in a related table (i.e., you should be normalized) and the reference to that entity should be part of your composite primary key.
Admin
Well good- that means I understand the implications!
On a totally not related note- I've seen posts say that CAPTCHA doesn't get the first post. It appears to me that it has nothing to do with "first post" but rather a timeout. Forgive me if this has been noted- I'm rather new around here.
Admin
True enough, although row-level security(user -> permissions) can have a fairly large amount of rows per user.
Another point: Would your customers want to have to worry about looking at history over two(or more!) different customer ids? Let's go with the example of customer -> orders, where customer id is a natural key entered by a user. Initially the stupid user entered a potentially offensive name of "HORE" for the Arabian Horse Owner's Association. With this id, the customer made 10,000 transactions from 1/1/2005 to 2/31/2005, then the correction was made and a new record of "HORSE" was created. From 3/1/2005 to 4/30/2005, another 10,000 transactions were made.
Now the customer wants to go look at their history. They enter their customer id of "HORSE" and enter a date range of 1/1/2005 to 4/30/2005. They only see 10,000 transactions when they are expecting 20,000. Then they try 1/1/2005 through 2/31/2005. No transactions at all. Then they slap their heads and realize: There is only 28 days in February! They also remember that their customer id was changed, so they go and look up "HORE" for the first date range and get the expected number.
I know any customers I've ever dealt with wouldn't care for this at all. The situation could have been completely avoided by using surrogate keys.
<FONT size=1>or maybe cascading updates.</FONT>
Admin
Two things:
1) you mentioned "maybe cascading updates". If you decide on ANY natural keys that might EVER change, and cascading updates are not available, then you shouldn't use them. That's pretty cut and dry. Again, do not confuse keys that define an entity with keys that relate entities. The former very often might be surrogates. The later almost never should be.
2) Another thing you missed is that if you decide to leave HORE as is, but then create a new HORSE customer, you would book negative transactions to zero out the original HORE customer account and then book the positives to move the transactions to the new customer. Now you have the best of both worlds: a correct customer code, correct transactions booked in the right place, and a perfect trail of exactly what occurred.
Admin
I think I mentioned earlier that much of my experience is with accounting systems that did not ever do a delete. We dealt with things like this on a regular basis by providing the users with transfer tools. In this way, there would be in HORE 10,000 new transactions and 10,000 offsetting "Transfer To" transactions moving the activity out of HORE into HORSE. In HORSE there would be 10,000 new transactions and 10,000 "Transfer From" transactions recieving the activity from HORE. Now the data is right, but there is also a complete audit trail and the system can reproduce the state at any point in time.
Admin
LAST!
Admin
One more thing to consider:
So you create a database-only surrogate key, CustID, is strictly internal for relations in your database. The customer code, "XYZ", is our business customer code which we stamp invoices with, customers use when they call, etc., but we avoid this as our natural key and use a surrogate CustID identity. This allows us to change the customer code as often as we want, right?
OK, let's take customer ID #45, and change the customer code from XYZ to ABC to correct a mistake or to update our records. Done! No messy cascading updates, no fuss, nothing else changed! Life is good. Take a vacation.
Two weeks later, someone calls customer support for help with your product.
"No problem", says the support rep on the phone," what is your customer number?"
Unfortunately, the person calling doesn't know that the customer # has changed, and tells you XYZ. Customer not found ! hmmm ... must have forgotten the customer code. Thinking quick, looks up the invoice for the product which had the customer code printed on it. He finds it, and there is again: XYZ!
"Still not found," replies the support person. Then he has an idea: "Well, since you have the invoice, what is the invoice # that you found? That will help me get the customer code." The customer replies invoice #1235. He looks it up, sees that it is linked to customer ID#45. Customer ID#45 has a customer code of .... "ABC" !
"Sir, something's not right here, the system says invoice #1235 was for customer ABC! "
"But right here, on the invoice i am looking at, it says invoice #1235, customer code XYZ! I am holding it in my hand!"
"Uh .. .please hold!"
Admin
I hate it when someone posts my point while I am composing my own post. :)
Jeff, it looks like you've been playing with the big boys for a bit. Drop me an email at my nickname at gmail. I might be in a position to hire someone in New England in the next six months.
Admin
One benefit of natural keys that I don't think I've seen anyone mention is select performance. You can only have one clustered index on a table. If the business key can be used as the primary key, you get the benefit of fast look-ups for both user requests and joins. This is can be particularly helpful if you are querying large detail (descendant) tables and you have carried the ancestors' key fields down to the child table as part of the primary key.
Bear in mind that I am not suggesting that this is always the optimal way to go. Preserving the ancestor key fields could use so much space that you are getting many fewer rows per page, negating the benefit of using the clustered index.
Admin
I definitely will, I am always up for new challenges .. you can also contact me as well through my weblog at sqlteam.com.
Admin
Dear customer,
as you might already have heard or read, our company has recently merged with Jeff's Empire.
As part of the consolidation process, our customer codes will be prefixed with "2-", while Jeff's Empire's customer codes are prefixed with "1-". So, from now on, your customer code is 2-XYZ. Please use that customer code for all future communications. Of course we will accept requests, orders etc. with your old customer code till Dec 31, 2006.
Best regards,
ammoQ's Lair
Admin
hmm ... you didn't read the post very carefully, nor did you get the point ...... If I were to respond to this with a response of similiar effectiveness/relevance to yours, I might say "the mail doesn't get delivered on sundays."
Admin
No sane company arbitrarily changes customer codes from XYZ to ABC. And even if, XYZ will not be assigned to another customer. And if everything else fails, every customer has a name and an address, so the call center guy would not say "it's for customer ABC", but rather "you are Mr. Smith, right?" Your example is totally unrealistic. On the other hand, because of mergers, reorganisations etc., it might be necessary to fix the situation that customer numbers are no longer unique. Prefixing like explained in my example seems like a realistic option to me.
Of course you need procedures to deal with customers who use their old codes, at least for a while. The guy at the call center knows about the merger, knows how to handle it, so if the customer says "XYZ", he will
a) know that this must be an old cold, since new codes are 5 digits
b) look up and see customer code "XYZ" is eighter Mr. Smith or Ms. Chin Quai, so
c) when in doubt, will ask.
Probably he will see which number was called, so he knows that the customer called the telefone number of company X, so he must be Mr. Smith.
Anyway, at some point in time, the merged companies will want to integrate their systems, their databases etc. and in that case, the artificial ID will come handy.
I don't say you must use IDs everywhere if you don't feel like it. It's just that I cannot stand people who dismiss that without valid reasons. I know a system which used natural keys; it has been replaced with a system that mostly uses artificial IDs. Both systems did resp. do their job.
The system with the IDs avoided some of the problems that riddled the previous system, of course at some costs. In that case, it's worth the cost. I don't know your systems, your requirements and I would not dare to tell you what is better in your case.
<°)))><
Admin
I routinely deal with accounting systems designed not to delete things (once the data is moved from data entry to permanent history, and not counting purging of old history at year end) but still allowing retroactive renumbering (not mandating it; you can still do things your way if you like). As usual, complete audit trails are at odds with convenience, and the right balance for an individual case depends on what level of audit trail you really need in practice.
Admin
I think most people may have missed a small detail on why 1899 is ground zero.
Did you know that an OLE Date (you know, the one used in VB's date format) is a double value, where 0.0 = Dec 30th 1899. That is, CDbl(DateSerial(1899,12,30)) equals zero.
Admin
Do you feel I've been "dismissing surrogate keys" in this thread without valid reasons? I have just been posting over and over "surrogates are bad" without making any valid points? Before you decide that you can't stand me, do me a favor and see if you can sum up my feelings on this topic in a sentence or two.
Admin
That last sentence didn't mean you personally, I'm sorry. It was also meant to read "... cannot stand it when people..."
Admin
My initial thought about this is that this isn't really a matter of what you decide to use as a primary key, but more a matter of knowing how to handle this kind of situation. Let's say that you like natural primary keys better than artificial primary keys, so when you design this table you decide to use the customer code as your primary key. Fortunately, your RDBMS has excellent cascading update capabilities and such, so changing the customer code for your customer from "XYZ" to "ABC" and changing all foreign key references is a snap. But once you do so your screwed just like the person in your story who used artificial primary keys, because you've erroneously decided that no history of a change like this should be kept, and that's where the real problem lies.
Does that make sense? It's time to go home and I'm getting low on caffeine, so I doubt my ability to communicate effectively.
Admin
Fiiiiiiiishy. =D
Admin
No, you are making sense. The bottom line is that mergers and acquisitions are a pain in the ass no matter how you model customers. It's just degress of pain in the ass.
Admin
not that I'm agreeing or disagreeing with you (or the person you are responding to here), but the correct response here may just be: "what downtime?"
In theory our system is expected to be up and running 24*7 - there is no "regularly scheduled downtime"... we only take it down as an absolute last resort, so any way of avoiding having to do major database surgery is a Good Thing... even if it means an extra key.
Of course I'm speaking in theory here as in practice our database structure sucks worse than the given WTF... but I was hoping just to inject an idea that maybe sometimes it's a good idea to avoid this sort of thing *ever* having to be done.
Admin
erg - sorry.. replied while still reading through the thread... so have duplicated another post. Apologies - i'll try to remember not to do that again. :(
Admin
<FONT face=Arial size=2>Hi guys</FONT>
<FONT face=Arial size=2>Just thought I'd add a couple more details of this system. The database is a SQL 2000 database (so "ground zero" is 1 Jan 1900 rather than the end of 1899), and dates are stored in datetime fields (small mercies, and all that) so referential integrity isn't an excuse. One table Alex removed from my original message to him, that I was particularly fond of, was tPercentage. I'm sure I don't need to give you the full details of what that contained. Let's just say that it had 100 rows that start like this:</FONT>
int identity</FONT>
varchar(50)</FONT>
<FONT face=Arial size=2>I hadn't meant this to turn into a physical versus artificial key debate, though it's been enjoyable. My reaction to this was more based on the unnecessary tables and odd choice of data types. The system, incidentally, was developed within the last five years, and I don't think an ex-Cobol developer (despite the evidence to the contrary) had been anywhere near it.</FONT>
<FONT face=Arial size=2>Dave</FONT>
Admin
Answer: Start by designing it right. Fully normalize, with only naterual keys. Get this design right first. Save it in version control for everyone to see (even though it will soon be obsolete and only refered to again in a debate). Then step back and sleep on it a little.
Now use experience to intelligently break accidemic theories. It looks like you have a good excuse to create an artifical key. You (or your expert friends) may also have some experience in the domain that says some intelligent de-normalization may be useful to make the database faster (in someplace where speed is needed), you can do this too, but make sure you carefully document your reasoning. Also make sure that this isn't premature optimization which is more likely when doing something like this.
Admin
Exaggeration aside, it is because the first number has meaning outside of the database, and the second does not. For example, a bank account number has meaning Out There. A problem with the internal number is that it can escape. What happens then?
Sincerely,
Gene Wirchenko
Admin
<FONT face="Courier New" size=2>dear gene,</FONT>
<FONT face="Courier New" size=2>can i put your signature on a t-shirt?</FONT>
<FONT face="Courier New" size=2>casually bloated,</FONT>
<FONT face="Courier New" size=2>emptyset</FONT>
Admin
You hunt it down. With dogs.
Admin
I like how Gene is so full of himself that he won't reply to this or any other question of him. What a pompous jerk!
Sincerely,
Richard Nixon
Admin
And then you use the two char abbreviation to refer to that everywhere, including outside the database, and suddenly one day North Dakota decides to change name to just Dakota and decides to change it's two letter abbreviation as well, and you're up the creek without a paddle, having to figure out all the places you've used the two letter abbreviation. Or the US government decides to switch to three letter codes for whatever reason, and suddenly your two letter codes ARE artificial keys (though I would argue they already are - they are an identifier created to refer to the natural key - the state name - in a shorter more succint way).
Country codes? Same problem - many country codes HAVE changed through the years to reflect country name changes, or have changed meanings as the countries they used to represent have been broken up.
The point is that id fields are as much part of the data as anything else. VERY FEW things genuinely have a naturally occurring identity suitable for use as a natural key. People keep bringing up SSN's, but forget that 1) SSN's aren't always unique, 2) SSN's can change for a person, 3) non-US citizens doesn't have a SSN and might not even have a unique ID number.
The same is true for a lot of other things. You brought up addresses, and tried to justify it by introducing an "addressType". However what about people with more than one home address? More than one office address? You would need to be able to refer to a specific address for the purpose of referring to it from another table, and you'd need to add something like "home address #2", at which point you might as well add a unique address id and stop shooting yourself in the foot.
But let's do something difficult. How do you uniquely identify a person? A customer, for instace? You can't use name, obviously - not only will you have duplicates, but they change too to make things more painful. You can't even use name and address - plenty of families have more than one person of the same name living at the same address. You can't use SSN as mentioned above. You can't use passport numbers - lots of people doesn't have them and they're not guaranteed to be unique.
THERE IS NO NATURAL KEY UNIQUELY IDENTIFYING AN INDIVIDUAL. Heck, even the full DNA structure won't be sufficient thanks to the joys of cloning. You have to invent a customer id. Now, you can try to pretend that this is part of the data so it isn't really an artificial key, but that is just bullshit. Artifical keys exists just for this purpose: When your assumptions about what constraints are valid on the candidate keys can't be guaranteed to always hold.
As a bonus they are useful because RDBMS's do not implement the relational model in a way that makes it efficient to use arbitrary long keys and because querying based on them is inefficient with SQL. If SQL let you do "SELECT foo,bar FROM table1,table2 WHERE table1.key(somekey) = table2.key(anotherkey)" or something similar, and automatically used artificial keys and constraints behind your back to optimize things and to handle changes in the key values smoothly, then I might agree for some cases (but that still leaves the cases where the identity of a relation needs to be referenced outside the control of the DBMS).
In practice, I can think of extremely few things that can be uniquely identified by natural data that are guaranteed unchanging. Locations? Names, zip code/postal codes, even country names and codes changes all the time. You could use latitude/longitude, but even that may be problematic (a region might be subdivided or a country might see regions gain indepenence, for instance, and the lat/long you've used may turn out to be in a part that's breaking away).
Here in the UK county borders have changed many times through history, including recent decades - the area I lived in changed counties AND had a land area more than the size of the original borough, yet the current Croydon borough (now part of London) is a direct continuation of the then much smaller borough of Croydon, Surrey, so this isn't just an academic example.
What's the natural key for an order? Surely not the items ordered, or the customers details, or the price or time or any combination of that, as they're not guaranteed to be unique? The closest to a "natural key" would be an invoice id - again an artificial piece of information (an aritifical key originally created for paper objects if you like), but even that isn't safe - it is not uncommon for a single order to result in multiple invoices or for a single invoice to refer to multiple ordes. And while that may not be the case in your application today, expect it to change the next time sales, marketing or finance gets a bright idea.
I challenge you to list 5 things likely to occur in a real application that have "natural keys" that aren't simply artifical keys created to manage the things they identify in the real world, and that are actually likely to be unique and stable (i.e. won't need to change)
Vidar Hokstad
Admin
Not a very good try. Account numbers change. "Often". While the account itself stays the same. The account number will often be referenced from large numbers of other systems, including systems that will NOT store their data in your database.
And account numbers may/will be recycled - the frequency would depend on bank policy and legislation, but it's not like the numbers are limitless.
A typical example of this is a bank with multiple sort codes that denotes a specific branch - though the branch system is slowly breaking down in most countries, some banks will change the account number if someone wishes to change the branch. But the account is the same - only the id used to refer to it EXTERNALLY has changed.
And an account may often be referred to by multiple numbers. Case in point: Credit card accounts, which usually have multiple card numbers associated with them.
In both these cases not tieing the account number to the identity of the account serves an important purpose: It decouples the two so that when one changes, it doesn't require an immediate update of half a dozen other systems, some of which may even be beyond the banks control (such as standing orders with companies etc.).
I recently changed my card numbers, but my old card numbers can still be used as payment references to settle my card balances - they're just blocked so they can't be used to charge the account any more.
So what would the natural keys for a bank account or credit card account be?
The nearest thing I could think of would be the original account number + the creation date, assuming no account numbers will be recycled within a single day (if so you obviously only increase the granularity). But this is information that really is no better than an artificial key for a bank account relation - the original account number alone only has value for a limited period before it becomes just a meaningless value used as part of a key.
Vidar Hokstad
Admin
<FONT face="Courier New" size=2>if there are no objections, i'm going to look into cafepress'ing this puppy.</FONT>
Admin
I've been reading the DWTF for probably close to a year now, but haven't bothered to sign up or comment on anything, until today.
I'm a hobby programmer, with no formal CS education and nobody to learn from face-to-face, since I work in an unrelated field. And as a hobbyist the types of problems I'm exposed to are relatively limited (no large databases, for example). This site is a great way to hear experts', and self-proclaimed experts', comments about a broader range of topics.
To me as a reader, this has been the best thread in a very long time. Long and thorough discussions of fundamental questions, with lots of examples, and far more interesting to read than a textbook. Thanks guys. [Y]
Admin
Vidar ! This is great stuff. Thank you for your posts. They are missing one thing, though: can you let me know what your conclusion is after all this, or what your thesis is? You made a large amount of points using several examples, but can you tie it all together for me so I can really get a full feel for your perspective on the issues. (I..e, what kinds of keys are good on what kinds of tables, rules of thumb to use when modelling, physical versus logical, etc)
Thanks for taking the the time to respond.
Admin
Or more likely (per a recent experience with my card company):
- Blocked my cards because I lost my wallet.
- Few days later I had to do a transfer, but hadn't gotten my new cards. Called customer service:
"Hi, I need to do a transfer, but I don't have my new card number - I reported my cards lost a few days ago. Can you help?"
"Sure, what's your old number?"
"...."
"Ok, looking up your account now. There, just provide me with some answers to some security questions, and then I'll give you the new number that's been linked to this account and we can arrange the transfer for you.
Me: gives them the security details.
Generally account numbers/card numbers is a many to one relationship to your account, where the key linking them have no semantics. It may be your original account number or account number plus creation date, or whatever, but there's no reason to do so - the data linking the two has only one function: To reference a unique account from an account/card number entity.And for your invoice example, I would NEVER, EVER design an invoice table that doesn't either duplicate ALL information that should be displayed on the invoice OR reference tables where I know the information will NEVER be updated or deleted (insertions only) - I do realise you're not advocating referring to data that will change here... Just wanted to point this one out. I've seen changing invoice data in too many US systems - here in the EU your auditors would get a stroke and your finance director would start worrying about jail time if they find out the development team has been building systems where issued invoices can change...
My solution for the scenario above would be to add new records, and provide a mapping from the user visible customer id's to an artificial key. I would not update the old ones, and I'd provide the customer care team with a simple way of looking up past user id's for this user.. I'd also as mentioned make bloody sure the invoice table kept unchanging copies of any field used for the invoice - and most likey I would keep an HTML or PDF rendered version of the invoice on original insertion. Yes, finance people often are - and more often should be - paranoid about ensuring their invoices never changes.
Vidar Hokstad
Admin
Ah, but the fact that the first number has meaning outside of the database is one of the main reasons why I would generally (not always) avoid using it as a primary key - things that have meaning outside the database have an ugly tendency to change. Both meaning, format, and for values for specific instances.
If the internal number "escapes", then you have the same problem with that as you do with any other number with meaning outside the database, but no worse. I'd rather take the risk that a value may gather some limited meaning outside the database than choose a primary key that definitively does have a value outside the database and where the meaning is under the control of someone else.
There are cases where values are stable enough for me to use natural keys as the primary key, but not often. Suddenly what was previously a unique key turns out to not be unique, or it needs to change but a significant number of external (to the database) systems contains references to the old value that you have no control over and you find yourself needing ot maintain a history of the externally visible id's mapped to a single entity.
Vidar Hokstad
Admin
Vidar,
I echo Jeff S. Good Stuff.
I have a couple of points.
First, I don't think that anyone here has been arguing that SSN is a good key. In fact I have used it several times as an example of a something that seems seductive but makes a poor key.
Second, we don't build data models in a vaccuum. We build them to implement business processes of some sort. So while there may not be a general unique natural key for employee, invoice, or order, a business process will frequently implement one. Every business of reasonable size will implement some sort of employee number, invoice number, or order number. Our data models are supposed to be reflective of the business processes we are building. It is impossible to build a generic data model, so our choice is really what parts of the business rules get built into the system and what parts get abstracted out. If some attribute isn't likely to change and has a reasonable set of business rules around it, we can use it as a natural key. Otherwise we fall back to an artificial key. This is where a database designer needs a good understanding of the business domain so that he can make appropriate judgements.
Admin
Jeff,
I must admit I first thought you were dead set against using artificial keys, but as I read through the rest of your messages I think we agree about quite a few things.
I'm certainly not against using natural keys where they are guaranteed to be stable, or where changes doesn't matter (i.e. where cascading updates isn't a problem). For me, using artificial keys serve two purposes:
- I would use them if the externally visible key is not guaranteed stable and I need stability, or where they are not guaranteed unique. I find that this is the case much more often than I'd like. I.e. the account number issue.
- I would use them as a possible optimization/simplification where the primary key would otherwise be a large composite key. I see this as overcoming database shortcomings - I mentioned at some point that I wished SQL had a simple way of requesting equality of a full composite key against a full composite key of another table without explicitly listing the fields, and that the DBMS then could do the whole artificial key optimization for us behind the scenes exactly when it makes sense from a performance point of view. That would save a whole lot of uglines and a whole lot of profiling that otherwise might be needed for pragmatic performance reasons.
The first reason above is really the most important for me. The more I work with databases the more I see uniqueness constraints on externally visible keys breaking down, and/or needs to maintain history and/or changes that needs to be contained.
My general solution to that is to either introduce a purely internal artificial key AND if the system needs to maintain history I will likely either pull the real key out into a separate mapping table OR create a mapping table for historical data.
I see either as workarounds to contain "real world breakdown" on what is otherwise sound theoretical models by containing all the ugliness that stems from dealing with actual customers (internal to the organisation or external).
How far I go depends on how glacial the changes are (I mentioned US states changing names as a possibility, but more as an example of how you always need to take your specific needs into account before assuming data doesn't change... For most apps. an occasional change like that doesn't warrant the introduction of artificial keys) and how static my data needs to be.
For an invoicing system for instance my basic assumption is: The invoice NEVER EVER changes, and any data it refers to needs to be static, but customer data rarely is, so I either introduce an indirection, most likely using artificial keys that should never make it out of the system anywhere, or I duplicate data. I may end up keeping the customers address in the invoice table, for instance, or copy addresses into a never changing invoice address table, or I may end up referring to an address table (this is one of those rare occasions where an address makes sense outside of a location).
I probably end up with more artificial keys than many people would like, often out of paranoid after seeing what kinds of "static" data suddenly changes ;) But it is not that I have a strong objection to use natural keys, only that I guess I have a very pessimistic view on what types of data actually is unique and changes rarely enough, and also that I often work with systems where the possible natural keys are used all over the place externally and where a change may take a lot of time to propagate to all external systems that reference them (hence the mentioned mapping tables/history tables).
In the long term I think this is much more of a problem with existing database technology than it is an issue with the relational model - database systems COULD do a lot more to reduce the need (or temptation if you like...) to use artificial keys by providing opaque ways of doing many of the things artificial keys are often used for, or at least abstracting away the actual values of those keys - if I use an artificial key to map an external id to a specific entity I don't care about the value of the artificial key.
In fact, I'd love it if the value of it was never, ever visible to anyone. What I want is simply an efficient way of linking one or more external id's to a unique record that may not have any suitable natural keys or where the natural keys that would otherwise be used are simply too large or change too frequently. That SHOULD be an implementation detail of the database. Obviously some places the values would need to be returned, but one way of doing this might be to introduce a datatype that only get bound to a specific user visible value during the lifetime of a single connection to the database. SPARQL, one of the query languges for RDF, allows this for attributes that does not have a provided value (they are essentially used as unnamed nodes in a graph), to able to explicitly prevent you from assigning any long term meaning to the returned attribute.
Vidar Hokstad
Admin
I'm reminded of the choice in Java to place pointers out of reach and instead use an opaque object reference. In this way, there is no temptation to do unsafe manipulation of pointers.
Admin
As much as I dislike Java, that was one of the good decisions they made IMHO. There are the occasional instances where being able to do pointer arithmetic is useful in C/C++, but far more often than not when it is being used it is for things where it's complete pointless... Interestingly enough current practice for C++ is moving more and more in the direction of avoiding bare pointers and instead use smart pointer templates that hide the actual pointer values - which also is pretty close to what I had in mind.
It would remove the issue that Gene Wirchenko pointed out as well about the id's "escaping" - if they're on purpose modified for each database connection, then even if some of the values "escapes" people will quickly learn that those values are useless to them for any other purpose than as a temporary system internal reference to an entity.
I'm almost tempted to do a proof of concept implementation for some simple DB like Sqlite where it would be quite easy to hook it in.... If only I had time :)
Vidar
Admin
<FONT face="Courier New" size=2>vidar,</FONT>
<FONT face="Courier New" size=2>bring me the gentle force of magma. take it to the accumulator and continuously destroy the toppings. please produce a notarized stamp in the likeness of the virgin mary; my heart condition requires it. walk my minature schnauzer in the darkness of the full moon. cook me the soup of a thousand meals and provide a large stack of those addictive crackers you crumble up and throw in. make the signal propagate through the air and land in the basement of a man by the name of charlie: he's got red hair and a big cigar. participate in the octagon and win the prize. watch out for snakes. feed the giant panda before it destroys the fig tree. ordain yourself and marry this lamp to my third cousin.</FONT>
<FONT face="Courier New" size=2>sincerely,</FONT>
<FONT face="Courier New" size=2>emptyset</FONT>
Admin
If you build a system especially for one company, you can depend on their business rules. They say "emplyee numbers are unique and never change" and you use it a your primary key. No problem.
On the other hand, if you create a system that is meant to be sold to several companies, you have to be paranoid about it. Your next customer might code the department into the emplyoee number, so it changes whenever a employee is moved to another department. Or they might have several profit centers, and the employee numbers are only unique within one profit center.
The main business of the company I'm working for is creating warehouse management systems, and after 10 years or so in business, we eventually decided to put artificial IDs into each and every database table. This gives us a better chance that at least some of the code can be reused. As far as I can tell after about 5 years of running this strategy, these are the drawbacks I've noticed:
a) Additional joins are necessary to retrieve what would otherwise be the natural key; but this should not be overestimated, since in many cases the join is necessary anyway (e.g. a list of order items not only shows the product number, but also the product name, so I have to do the join anyway).
b) More difficult are situations where the whole content of a table is regulary imported from an external system and the easiest way to synchronize would be to delete the old data and insert the new version. (Of course referential constraints have to be disabled to do that). Artificial keys make the delete-insert method impossible, since the newly assigned IDs would not match the old values.
c) In some cases it is desirable to be able to delete a database row that is referenced by another database row, and keep the other row. (Of course you need to omit the referential constraint to do that.) If the foreign key in that other row is a natural key, external systems can be used to look up it's meaning should it be ever necessary, which is "good enough" in some cases. For example, it can be desirable to delete orders (in the warehouse management system) a few weeks or months after they have been processed, but keep the stock transaction log (which refers to the orders) longer.
On the plus side, the issues of changing/non-unique natural keys are gone; code resuse between different customers has become much easiser. E.g. a STOCK_ITEM is always located on a LOCATION and contains a PRODUCT, but the structure of the natural keys of LOCATION and PRODUCT varies between companies; so with artifical keys, moving a stock item to a new location is always
update stock_item set location_id = :new_location_id where stock_item_id = :stock_item_id
, while natural keys might force me to writeupdate stock_item set location_no = :new_location_no where warehouse = :old_warehouse and location_no = :old_location_no and company_no=:company_no and product_no = :product_no and product_variant = :product_variant
for one company and something completely different for the next. (This example has been simplified, in reality it's more complex)The fact that all keys have a predictable name and the same datatype gives us additional benefits on the technical side.
Admin
(part 2, since the forum software did not post the final version... I guess everything I changed after switching back from HTML to Design is gone)
Drawback d): ad-hoc queries are less readable, or you write longer queries to do all the joins
To avoid the drawbacks where they really hurt, in some cases we use the "natural key" as foreign key in other tables. (We keep the artificial IDs because of the mentioned technical benefits)
Admin
Until they merge with another company and say "we need to change the employee numbers so that the two companies' employee number systems don't collide". I rather doubt they'll be happy with the answer "That's impossible. You said the numbers would never change and I based my DB design on that!"
I think it was Martin Fowler who said "There are few things that are less logical than so-called business logic".
Admin
The correct answer in that case: "It's possible, but it will cost a lot. The requirements specification said the numbers would never change and I based my DB design on that!"
Admin
Don't forget that for fundamental changes such as this, lots of things need to change and the data model is only one of them. To use your employee number change as an example: Even if the data model is quick and easy to change, it isn't hard to imagine that easily 50% of your transaction screens and reports will need to be modified. Unless someone did a truly amazing design job, the application changes are likely going to dwarf the data model changes regardless.
Anytime you modify a key that is fundamental to your system, it is going to be expensive. Should adding a few hundred thousand dollars on a several million dollar project that may or may not happen at indeterminate time in the future drive your design decisions now? This potential future expense has to be balanced against needing to perform additional joins for virtually every query right now, as well as maintain additional indexes and constraints for a large percentage of your tables.
Admin
This is another one that drives me nuts. As "RevMike" points out, changing the database tables is the least of your worries. Your entire application changes. This is not a good argument for never using a natural key; it is a good argument for thinking ahead and designing your database with flexibility. Using natural keys does not in any way inhibit flexibility or growth if you design things properly. In fact, it forces you to think ahead and forces your application coders and report writers to plan ahead when they write their code if they are using a properly designed datamodel and not just pulling ID's everywhere.
Often, using surrogates gives you that false impression that everything is perfect and flexible and a change is just a quick UPDATE statment to change customer numbers. That mindset is far, far more dangerous and costly than actually taking the time to design your model to be flexible.
There is also a huge contradiction in this scenerio. Why do we have to change employee numbers? Well, because of the merger, we have operlapping employee numbers. But we are changing the employee numbers! By definition, if we can do that, then we can avoid the overlaps ! Either we are changing them , or we are not! If we can change the old numbers to append a "1-" before them, and change the new numbers to append a "2-" before them, why do we have to change both the old numbers and the new numbers? In either scenario, we assign the new guys new numbers !