- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Technically true (at least in my understanding) but mostly irrelevant, since the usual problems with transitive dependencies do not apply in this case. This is because you will make sure that the "natural key" is unique, even if you don't use it as your primary key for joins.
Admin
I prefer to model this as a seperate address entry for each of Joe, Sue, Bob, and Bubba. The fundamental problem with modeling them as a single address is that were Bob to move out, your software would have to detect that the address is shared and modify the data for Bob while leaving Joe, Sue, and Bubba at the same address. This complicates your system needlessly.
On the other hand, if Joe, Sue, Bob, and Bubba were linked to an intermediate entity that then had one address, I see know problem with them sharing that address. So if Joe, Sue, Bob, and Bubba were employees who worked at the same bank branch, they could be tied to that branch and then the branch could be tied to the address. Now if the branch was to move down the road, all the branch employees would automatically move with it. However, if Bob were to transfer, he would be tied to a new branch.
Admin
Because GL account numbers, Customer numbers, etc etc etc are subject to change (provably; I have seen both of 'em actually change at multiple clients for multiple good reasons).
You're claiming that no table should have any alternate keys? I don't believe any of the NFs claim that. I'm using Wikipedia's article on "database normalization" as a reference here - if you have a cite to the contrary, please post a link to it.
Even if it does break one of the NFs, it has practical value anyway, because the natural key that you would intutively pick as the primary key is so often subject to change. Also, if you make a consistent practice of using artificial keys of type int (or bigint, if any of your tables might ever reasonably get big enough to need it), then you can write table joins without having to remember or look up what type each table's key is.
Admin
I don't need a link to the contrary. Your own link has the reference I need. Please look at the section on BCNF.
If there are both natural and artificial keys, then attributes can be expressed as dependancies on either key. Therefore, by your own reference article, a table should not have alternate keys.
I've already pointed out in a prior post (probably page 2) the danger of choosing a natural "near-key" as a key. My particular bad experience has been with using SSN as a natural key for person. It works just well enough to become really entrenched in a major system before it bites you in the ass.
This I don't understand. Why do I need to know the type of a key column in order to write a join? I've never seen that in any SQL I've ever written. Of course some idiot could make a foreign key of a different type, but that is a WTF itself. Please provide me with an example of a major SQL dialect (DB2 (all three flavors), Oracle, Sybase, SqlServer, MySql, Postgres, Informix, Terradata) where the type of a column is important when writing a join.
Admin
Sorry to reply twice. I hit post, then remembered the other half of my counter argument...
When you are dealing with sufficiently large tables, you in fact don't want to use an artificial key when a suitable natural key is available. A natural key, or more likely a component of a natural composite key, is likely to provide you with the predicate you need to do reasonable partitioning. Partitioning on a artificial key is pretty useless, since that column is already indexed and is only referenced by to perform joins anyway. On the other hand, a natural key can help your database engine narrow a search to a partition very quickly.
As an example, say you have a transaction table which has a composite key of account, transaction date, and transaction sequence number. If you need to search for a particular det of transactions, and you have a transaction date range, your artificially keyed table will need to do a full scan, or you'll need to add a transaction date index. Since there may be hundreds of thousands of transactions per transaction date, the transaction date index offers some help, but still you'll resort to scanning lots of data because either you need to merge those rows with results from other indexes or scan a large number of blocks.
On the other hand, if you partition on transaction date, your engine will first select the appropriate partition(s), and then search only the indexes in that partition. Your performance goes up dramatically.
Admin
For me, I am generally suspecious of natural keys, because they make assumptions that generally can be wrong, or can change.
The biggest concerns are that:
Lets say, you print an address, and you want 'CA' for the state to show up - and quite nicely, you've used the abbreviation as your natural key for your state table, and you don't even have to do a JOIN to print the state in the address.
Then, the client asks for something retarded, like 'Ca' instead of 'CA' to be printed up because he thinks all caps looks unprofessional. Never underestimate the sorts of things clients will do or ask for.
Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two identical records.
Take
int a = 0;
int b = 0;
They are not the same - only their data is the same, if you wanted to model that in a database you'd need something like:
table ints:
memoryspace value
1 0
2 0
To me, it seems like, natural keys are an unnatural way to get around the fact you are dealing with seperate entities - trying to make the data values reflect its unique purpose. Thats nice, but I like to think in terms that a customer row entry is an Instance of a customer record, and therefore to the database, the fact it is a unique instance should be recognized, - heck, its natrual when you consider the business logic recognizes two seperate entities as seperate, regardless of their data - and when its so easy to just have an instance ID in that table - why would you ever want to cripple the flexibility later? I mean, you have to assume at least one column will never change and always be unique. Where is the advantage in that, when the client can very well change that factor down the road?
Admin
should read:
Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two possibly identical records.
Admin
I just wanted to add: It really isn't that difficult to cascade primary key changes in any modern database. (Even mysql now!) You change the key and you go on with your life without worrying about any of this crap. If you're working with other databases it simply takes a little planning and time to cascade them via a script, and these are supposed to be the values that hardly ever change, so it shouldn't be a big burden to change every few years or decades.
If your applications have their primary key baked in and recompiling/redistributing is out of the question, you're probably going to have maintenance nightmares long before the need to change natural primary keys comes in.
Admin
Both the natural and artificial keys are candidate keys; whichever one you don't mark as the primary key is, by definition, an alternate key. Even 5NF talks about "candidate keys", plural; how does this jibe with "alternate keys are always bad"?
If you're saying that changeable external data is not only a poor choice of primary key, but not a key at all (and anyone/thing that labels it as one is not only guilty of a poor choice, but is flat-out false), then that would seem to eliminate yet another big chunk of key arguments (albeit opening up another big chunk in its place).
Sorry, thinko on my part; I should have said that you can design new tables without having to remember or look up what types its FKs should have.
I never explicitly do any of this stuff. The real-world package that I'm using as a mental model, whatever it builds into the default install is plenty sufficient (one of my current projects joins something like ten tables and millions of records in thirty-ish seconds).
How do you define partitions? Does an artificial PK prevent you from partitioning on natural data? How do you determine whether adding a certain index will likely give enough of a performance gain to offset the performance cost of maintaining that index during table writes?
Admin
Nice flamebait!
Admin
*lol* In theory, yes, you can do that. In practice, it means updating millions of records, causing deadlocks, exploding your rollback segments, freezing the app (because of the locks) for minutes.
In theory, theory an practice are equal. In practice, theory and practice differ. Things that can change at all have a tendency to change more often than you think, at inconvinient times.
It's by far less effort to change all input/output routines that are affected by the structure of the "natural key" of a certain table than changing all other places where the primary key is used to make joins.
Admin
mmm. i don't really agree with that sentiment.
while you're correct to say that a seperate logial key is not needed for join / mapping tables the problem comes when the relationship between the 2 tables starts to mean something in itself.
if you ever need to create a child table or some other sort of link to the mapping row you end up having to use composite foreign keys, which just make the model horrible in my mind.
Admin
For those that don't believe in tables of dates, lets demo a quite common scenario.
You have a financial transaction history. A real simple one.
TransactionDateTime, Amount.
You can have a transactionId if it makes you happy.
What we need is an output going back to the date of formation of the company of the number of transactions and the value of the transactions broken down by year, month and day. Additionally we need a breakdown of activity by time of day in 10 minute chunks aggregated over any given period.
Due to some internal/external requirement we need to know if 0 events happened in any of those time periods/slots.
How will you code that?
Admin
The fact that the default install is good enough tells me that you are probably working with a database that is closer to medium or medium-large. I've done plenty of these system in my career, but I'm currently doing consulting for three of the five wall street banks. I'm not working with the actual trading systems (which have lots of special magic) but I am dealing with an ODS loaded nightly from the trading systems. This system grows by several tens of millions of rows nightly. These types of systems are large, but not huge. The really big systems handle airline reservations, for instance, and credit card transactions.
The artificial key does not prevent partitioning on natural data, but can mean that an additional attribute has to be carried into the table just for the purpose of partitioning. As far as maintence cost during table writes, partitioning helkps reduce this as well. The indexes can be maintained against individual partitions instead of the the entire table, so they are smaller and easier to maintain. The performance gain comes when most queries only scan a subset of indexes instead of the entire table.
Admin
select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=
Admin
select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=b.TransactionDatetime(+) and a.tsttill>b.TransactionDateTime(+) and a.tstfrom<sysdate
group by a.tstfrom
Admin
Paddy, you need to go back and read all of my posts about what the "problem" is with simply tagging ID's onto all your tables and ignoring data integrity.
Admin
Another in the "A primary key must be a single ID column" crowd. If you feel composite primary keys are a "horrible model", then (and I apologize for this, but it is true) you know absolutely nothing about relational databases.
You can make arguements about physical limitations of large keys, but to use the word "model" in the above statement is completely ignorant.
Admin
Ah! Now your perspective makes sense... you're an Oracle guy! So you don't normally need to deal with relational data or true joins, you have "rownum" and where clauses with + symbols in them!
[:)]
For those of us using relational databases, there are often huge advantages in having date tables or, more generically, number tables stored permanently for SELECT's like the one mentioned.
Admin
The given SQL-statemtent (incomplete, thanks to the ID-iotic forum software) is probably not the way I would do it in real world, just an example to show that it is possible. The (+) thingy in the where clause to indicate outer join kinda sucks, no question.
Anyway, the rownum pseudo column fits into the relational modell just as well as a number table; or the possibility to read only a part of a result set by exiting the loop, for the matter.
Admin
What loop? You lost me there.
Admin
The obvious solution here is a duel at dawn. My dad can beat up your dad.
I tend to agree with a previous poster who tags the initial WTF post as flamebait. Though it may be "technically" true that artificial IDs and "correct" relational database modeling are mutually exclusive, the tone of the post implies that those who, for sound practical reasons, employ "artificial" IDs are complete boobs. It smacks of snobbery and elitism. Sure, we all need to read up on the fundamentals, but if you can't take the time to summarize the factual points about why either method is good/bad, please don't stir up the mud. Then again, this is TDWTF, and perhaps not the place to help enlighten others - only to tell us how much we need enlightening.
As in everything, there is a "correct" way of doing something, and a "practical" way of doing something. I'm sure the roadways in my city are designed by REALLY SMART PEOPLE, using proven mathematical algorithms to test flow patterns and such. The problem is, nobody drives like the models say they should - so congestion just gets reappropriated to other parts of the city or interstate.
It's no different with real-world data and databases. The best designed systems often fall apart under everyday use for any number of reasons. So, maybe my databases are not "real" RDBs at all - but that doesn't mean that they are any less useful or meaningful.
Someone used a "world is flat/round" metaphor a while back. I don't think that analogy was fully correct. If nobody ever took the time to question the "authority" on the shape of the earth, we may well still believe it was flat.
Admin
select x from y where rownum<=10;
is mostly the same as opening a cursor for
select x from y
, processing the first 10 rows and exiting the loop then.In both cases, you are working on an randomly choosen subset of 10 rows, something that can hardly be expressed in terms of set operations, as relational theory requires.
Admin
OK, I'm lost ... you've demonstrated a worse way to write SELECT's rather than using a date or numbers table to make some sort of point, and then mentioned the advantages of "exiting a loop" by demonstrating a way to get 10 meaningless, semi-random rows from a table to make another. I am not sure what those points are, unfortunately ... Relational DB's are "bad"? Cursors are "good"? That you don't know about "SELECT TOP" ? This proves all tables must have an "ID" ? Tie it all together and let's here your thesis!
Admin
oops
Admin
I think I am going to have to disagree with you there. The definition of BCNF you cite states, "At this stage, all attributes are dependent on a key, a whole key and nothing but a key" (emphasis added by me).
Any given table can have any number of attributes or combinations of attributes which are sufficient to uniquely identify any row. These are called candidate keys, and from them we can select a primary key. However, all the candidate keys we did not select are still going to be in the table, and we could just as easily uniquely identify a row in the table using one of them as we could do the same using the primary key.
Adding an artificial key does not break BCNF as defined above. It merely adds another candidate key to the table.
I understand that wikipedia might not be the best relational database reference, so I'm open to the idea that the aforementioned definition of BCNF might not be correct. If you think it isn't, by all means, post a reference that says so; I'd be very interested in reading it.
Admin
I'm not a 3rd semester CS student talking out of my ass here. Account names are probably on the threshold, but ISBNs/UPCs, product codes or serial numbers, are things that simply don't change often. I could come up with others with a bit of morning stimulant. If you have a table with fields that change constantly (defined as "often enough that cascaded updates that cause a major performance impact can't be scheduled for downtimes") you'd be insane to make that the primary key.
I don't follow your third point, the way you phrased it is very confusing.
Admin
SELECT TOP doesn't exist in Oracle, but "ROWNUM<=..." in the where clause does mostly the same. (Some caveats concerning ORDER BY exist)
So, ROWNUM in Oracle can be used
- instead of SELECT TOP
- to generate numbers for joins (can be abused to generate dates, too)
and some other arcane things.
IMO this doesn't make Oracle less relational than other so called database systems.
From a puristic relational point of view, one might argue that cursors are bad.
We're OT here, since all this says nothing about whether or not IDs are evil.
Admin
Varchar doesn't waste space. I can't keep reading more posts harping on this without responding.
Admin
Let me clarify, varchar doesn't waste space for character data :).
Admin
Ok, this is a very interesting thread to me because I am NOT a database guy, though I have been called upon to design databases to support my applications. In these cases I always give it my best shot and then take it to someone I respect who has been around databases longer than I have to give constructive input. Actually, this is generally the same process I do with any design, but with databases it is more like approaching a teacher rather than a peer. I say all this to demonstrate I am very open and realize my knowledge is limited.
I've almost always used artificial keys, and none of the database people I've taken my designs to have ever balked at this. Which doesn't mean anything significant except that I am very surprised to read that some experienced developers apparently believe that not only are they bad, but that this is a fundamental issue in DB design. So I want to know more. I'll likely be re-designing a database here in a few months and I'd like to do a good job. Right now the database is VERY denormalized and has no foreign key constraints at all. I'm certain I can do better. But I'd really like to nail it and get it right, not just better.
So here's a very specific question from a real database I’ll be re-designing in the next six months: this application track bonds. Every bond is issued (externally) an identifier known as a Cusip. So clearly I will have a table for these bonds that contains the information. Now obviously a Cusip makes a nice primary key on the face of it, until you realize that sometimes Cusips are reused after a bond has matured and the two won’t be confused. So there you go- textbook case for a multi column natural key right? <!--[if !supportEmptyParas]-->You can set the maturity date and the Cusip as the primary key and you’re done.
So now I store both pieces of information in every table that I want to relate to this bond, correct? The Cusip and maturity date pair will proliferate through the data and my joins will now be on both fields. Not what I’m used to, but I can handle that if it’s going to make my data integrity stronger. But what do I do when the user suddenly realizes that the Cusip or maturity date was entered wrong and this error has been proliferated throughout the entire database? My integrity is intact, but the data is very wrong and fixing it is going to take quite a bit of error-prone updates to the database. The more relations there are to this table, the more difficult it is going to be to make sure they are all changed correctly. I suppose I can just force the users to enter new records and do all of the updates by hand, but now I’m putting constraints on the user that are difficult to explain, especially when I’m not certain myself. But if I only stored a reference to the data (an identity column) rather than proliferate the errors throughout the database, the data is easy and safe to correct. So how to you (the the natural key camp) handle this situation which seems textbook made for a natural key?
On a side note, what is this about Stored Procs being bad for integrity? <!--[if !supportEmptyParas]-->My manager has just deemed that almost everything we do should be in Stored Procs, so if there is some good information on why this is a bad thing (or questionable at least), it’d be good to know. <!--[endif]--><o:p></o:p>
I realize that a lot of people are tired of this discussion and feel like it’s “just another religious debate”, but I’m very interested in the perspectives of those with a lot more database experience than I.
Admin
First to address Jeff's post:
I didn't mean for that to sound accusatory or anything, so I apologize if it did. I was just interested in your take on my proposed changes to your design.
RevMike:
I'm not sure if you were addressing Jeff's design or mine, so I'm sorry if I'm missing the point.
In my proposed design, you'd have an address table and a table linking people to addresses. That table would have PersonName, AddressID, and AddressType columns. If Joe, Sue, Bob, and Bubba all live at the address with an AddressID of 123, then our table would have the following 4 rows in it:
Joe, 123, Home
Sue, 123, Home
Bob, 123, Home
Bubba, 123, Home
If Bob moves to another address, say the address with an AddressID of 456, then all you need to do is change Bob's row in our table so that we have:
Joe, 123, Home
Sue, 123, Home
Bob, 456, Home
Bubba, 123, Home
Of course, this will potentially leave you with addresses in the address table that aren't tied to anybody, and that's obviously not ideal. Anyway, how else is this needlessly complicating the system, or am I just misunderstanding you? If so, please let me know.
Admin
Without knowing more about the specifics of what you are storing, and how often these dates are "wrong" and need to be adjusted, it is hard to be able to give a good data model. That is the problem many people face with designing a db -- they come up with ideas or make decisions before considering carefully the whole of the system.
Keep in mind that in most modern RDMS's, foreign keys can cascade and automatically be updated with any changes in the primary key. Not sure about mySQL and Oracle, but SQL Server >=2000 does this. (In fact, even good old Access does this). So, if you have proper integrity set up with cascading updates, it should not be a problem.
It comes down to this: for two bonds with the same CuspID, are they related? Is there any connection between the two, other than being assigned the same number externally?
If yes, then you should have a table of CuspID's and all data that common amongst bonds with the same CuspID stored in it. Then, you would have a child table of the Bonds themselves with a pk of CuspID/Maturity Date.
IF the answer is no, there is no connection, then I would have a table of Bonds, unique constraint on CuspID/MaturityDate, and then probably add a BondID surrogate and let that identify each bond. Since two bonds with the same CuspID have no relation, there is no need for the two bonds to share a common parent in the key that uniquely identifies them.
See the difference?
Admin
ISBNs don't change. UPC/EAN/GTIN are a different kind of beast. But that's too off-topic to go into detail. Enough to say that there is a reason why each and every business maintains their own product number instead of relying on UPC.
The "often enough" thingy obviously depends on scheduled downtimes. Unfortunately, downtimes get shorter this times. Sooner or later the company starts selling through the internet and want its systems run 24/7. No more scheduled downtimes...
My third point says: If you build your system with "Product_number" as the primary key of the product table and join all other tables (stock items, order items, transaction journal etc.) to the product table on the "Product_number" column, it hurts badly when - after some reorganisation, company buys other company etc. - "Product_number" is no longer unique, but you have to use "Company_number"+"Product_number" instead. This literally fscks up your whole database and probably large parts of the program. On the other hand, if you've used that unloved ID column, you have to revise all programs that read or write the product number, so they also read resp. write the company number (where necessary); but other parts of the program are not affected.
Admin
You can safely ignore that comment; in fact, based on that statement, you might be wise to ignore any comments made by whoever wrote that.
Admin
Are you kidding? You can only have a composite foreign key if you have a composite primary key. So you are essentially saying that any model with compositie primary keys is horrible. You are ignorant.
Admin
My sentiments exactly... I couldn't help but laugh at that myself.
Admin
It most certainly can. Not necessarily will.
Consider your typical 1 character code. Shall we have that a char(1) or a varchar(1). Or indeed a varchar(50) with a constraint restricting it to a single character. Usually you will find that adding a variable length character field has all sorts of repercussions. In M$ SQL Server iirc it takes 12 bytes per row to add one or more varchar column to a table, and an additional 1 or 2 per field per row. So now the choice to use a varchar(1) instead of a char(1) on a table that is quite narrow to start with has some serious implications.
Admin
You know full well I wasn't addressing char(1) vs varchar(1). That wasn't the context of the comment.
Admin
Some details- this will be SQL Server 2000 so cascading is available. The Cusips and maturity date are entered by hand, so I always have to assume they could be wrong. Two bonds with the same Cusip are NOT related- it's just a simple example of id re-use over time (much like SSNs). A child table would make no sense.
Admin
I wouldn't be so sure there. Theoretically, different editions (such as hardcover/paperback) of the same book must have different ISBNs (this may already qualify as "change", depending on your needs), while different printings of the same edition must use the same ISBN even though they may not be identical (e.g. price, errata), and ISBNs must never be reused.
In practice, the difference between edition and printing may not be totally clear or get disregarded, and I've even heard that there have been cases where ISBNs were reused for completely different books.
Admin
This might possibly be the best reason I can think of to not use natural keys(in general of course). At some point, a user must input the value for the natural key, and we all know how stupid users are. No matter what we do to validate input(double-key, regex, etc.) users will still be able to enter incorrect values, and depending on the number of users, it could be fairly frequently.
If we use surrogate keys, we will find that correction of invalid input becomes a lot easier because we don't have to worry about the foreign key in another table when we fix the invalid input. If the user enters "HORE", and this value is used as a foreign key in child tables, a cascade update might possibly be troublesome. But if the value he enters is simply for a column with a unique constraint with a surrogate key representing that row, he can go back and easily change it to "HORSE"(what did you think it was? sicko!) without worrying about the possible troubles involved in a cascade update.
Of course, let me qualify all my statements as a person who believes surrogate keys are preferred, not required for all tables, yadda yadda yadda....
Admin
Well, then, there you go. The key to this (no pun intended) is that the two cuspid are not related. Therefore, you would not want a composite key, since they would have no common parent.
Sounds like a surrogate might be a great choice here (in addition to that unique constraint on cuspid/maturation date -- don't forget data integrity!) , unless someone else about the Bond serves as a nice natural pk.
Admin
Part of the problem with this, aside from the orphan row problem, is that every address change involves a lookup of all addresses to determine whether there is a matching record already. The logic would need to create the new address record or simply update the relationship to the existing record. This relatively expensive search needs to be serialized, otherwise two users could simultaneously enter duplicate addresses. Now your overhead is way up and throughput is limited.
I was actually thinking about it in the opposite way. Rather than create a new address we could update the existing address record.
When I model, I don't like to reduce coincidental relationships. There is nothing fundamentally wrong with doing so, I just don't like to. From your example I inferred that Joe, Sue, Bob, and Bubba are four different people with four distinct relationships that we are trying to track. I don't see a good purpose in reducing them to a single detail. If the relationship was more than coincidental, however, I have no problem with them sharing an address through an intermediate table describing the relationship, e.g. a Household table.
As I think about it more, I'd really like to put the address type on the address table itself. I don't like having the Many-To-Many table there. I think this is more intuitive....
Person
PersonID, PersonName, BirthDate
1, Joe, 1/1/1970
2, Sue, 1/2/1971
3, Bob, 1/3/1972
4, Bubba, 1/4/1973
PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223
3, Home, 123 Main St, Anywhere, NY, 11223
4, Home, 123 Main St, Anywhere, NY, 11223
Now, to update Bob's home address, I just update to this...
PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223
3, Home, 999 Broadway, Anywhere, CA, 99999
4, Home, 123 Main St, Anywhere, NY, 11223
Note that this update doesn't require any large searchs and happens cleanly inside a very small transaction. If my platform support row level locking, other updates to other records happen simultaneously.
Depending on the application and platform I may or may not create a domain table for address type, potentially using a constraint or a user defined type to limit the values. I'm not religious about this for an extremely small set of values. In this case I'm thinking the AddressTypes might be limited to Home, Office, and Other. If I do make the AddressType a foreign key reationship, I'd likely use a small character field rather than an ID. It is admittedly a crutch, but one which eases the work of maintainers. I don't mind compromising normal form, I mind when normal from is compromised without consideration of the impact.
Admin
Depending on your platform and version, this choice can have a dramatic effect on performance. Oracle has handled VARCHAR types very cleanly since 7.0. Sybase only got good varchar handling in version 12. (I'm not sure where SqlServer sits today.)
Oracle early on allowed variable length records within a database block. Varchars are stored within the row. Older Sybase, however, didn't store the varchar in the row, but only a pointer to the Varchar. The data itself was in another block. So reading an Oracle row took only one read operation, while Sybase would potentially perform that read, then go back to disk and do a second read.
Admin
I'm not sure I see why this would be a huge problem. If a record labeled "HORE" wa entered incorrectly, the user would simply create a new record labeled "HORSE" and go on without a problem. This maintains history. (I have a bias, having spent 5 years maintaining an accounting system that never had a delete.) :)
Admin
The question to ask is: "Are addresses attributes or entities?"
If an address is just an ATTRIBUTE of an entity, just info you just want to store for a customer, then you either a) put it on the customer table or b) create a child addresses table of customers so you can store many addresses (1 per type, of course; pk = customer/AddressType). That situation is similiar to what RevMike is describing.
If an address is an ENTITY, something that you want to keep track of, something that you want other entities in your database to potentially share, then you have a separate Addresses table (pk of addressID if you want, be sure to add unqiue constraints) and you have a situation similiar to what Uncle M has described where you relate entities to addresses. Either each customer has an AddressID column in the table (for 1 address per customer), or you store multiple addresses per customer in a table with a pk of (Customer/AddressID/Type).
See the difference? It depends on your needs; both situations are correct.
Consider a phone number: the area code is shared among phone numbers. Should you create an area codes table? You do the same test: if AreaCode is an entity you need to track (i.e., to store the state or area or long distance info) then you have a table of AreaCodes (hopefully, with a pk of AreaCode, not an AreaCodeID identity!) and you relate. But if you are only storing the area code as an attribute, then you just have a column called areacode in your tables and you fill it in and don't worry about 30 phone numbers having "duplicate" information.
Admin
You make a great point, Jeff S.
I think the greater number of systems out there consider address as an attribute of the customer or employee entity. However, it isn't hard to think of systems that would use address as an entity in an of itself. Immediately I can think of a "ServiceLocation" table for a utility company or a Property table for a proeprty tax authority.
Admin
Yes! In fact, I have found that addresses themselves are almost never entities, but rather attributes of locations, like you mention. Especially since a location can have multiple addresses (shipping, mailing, physical), etc. That's a great point and something I should have mentioned: if you think an address is an entity, then you are most likely wrong and you want to track locations with (potentially multiple) address attributes.
Admin
Let me try to rephrase this as a general principal, and please let me know if I get it right. If a row in a database can be identified based on a single column, then that column should be used as your primary key. If there are multiple columns then you might make a compisite key, depending on whether or not rows with a matching value for part of the key would share some data. If they would not, then an artifical key makes more sense (yet still putting the contstraint on your data to keep from duplicating rows you shouldn't).
If I understand this correctly then wouldn't at least part of each of your composite keys always exist in a parent table as a primary key?
Also, is there a good book on DB design that I should make part of my reading over the next few months? Something long on practical with enough theory so I'd know the "whys" of good db design? I don't intend to ever be a db programmer, but I like to know enough about whatever I touch that I don't create more problems than I solve.