The Daily WTF: Curious Perversions in Information Technology

ammoQ · 2005-11-06 Reply Admin

RevMike:
I do say that using an artificial primary key when there is a valid natural key violates third normal form.

Technically true (at least in my understanding) but mostly irrelevant, since the usual problems with transitive dependencies do not apply in this case. This is because you will make sure that the "natural key" is unique, even if you don't use it as your primary key for joins.

RevMike · 2005-11-06 Reply Admin

Jeff S:
UncleMidriff:

Now, what if Joe, Sue, Bubba, and Bob all live together and join your club. You'd have four rows in your address table all with the exact same data save for the Customer foreign key. While this would work well, and I can't forsee it causing any problems, it bugs me that we're storing the exact same address more than once. I might propose that we move all the address info to a separate table, append some kind of AddressID to it, and then add a reference to that AddressID to our first table. That way, all the same constraints are enforced, but instead of repeating the entire address 4 times, were only repeating the small AddressID value 4 times. What do you think? Have I gone off the deep end?

That's a great modelling question, and there's many ways to handle that, and it can be a good discussion. Unfortunately, I don't see where a single word I've written disallows this; that is very different from the storage of addresses scenerio I mentioned earlier. If each entity can have exactly one address, but some enties can share addresses, then indeed creating an address entity might be the way to go, and adding a surrogate key to that table would probably be a great idea, instead of having the key being all columns of course. The situation I refered to was where a single entity can have multiple addresses, and ways to model that situation.

I will repeat (yet again) that there are many situations in which surrogate keys are very useful.

Maybe I should just add that to my signature? It will save some typing.

I prefer to model this as a seperate address entry for each of Joe, Sue, Bob, and Bubba. The fundamental problem with modeling them as a single address is that were Bob to move out, your software would have to detect that the address is shared and modify the data for Bob while leaving Joe, Sue, and Bubba at the same address. This complicates your system needlessly.

On the other hand, if Joe, Sue, Bob, and Bubba were linked to an intermediate entity that then had one address, I see know problem with them sharing that address. So if Joe, Sue, Bob, and Bubba were employees who worked at the same bank branch, they could be tied to that branch and then the branch could be tied to the address. Now if the branch was to move down the road, all the branch employees would automatically move with it. However, if Bob were to transfer, he would be tied to a new branch.

emurphy · 2005-11-06 Reply Admin

Just Another WTF:
I have no problem with using a Customer Number that is actually part of the customer's data. If that customer number appears on his invoices... if the customer can identify himself to your company by that number then the datum is real and is a perfectly reasonable thing to use as a primary key. As Jeff has been pointing out, he is not against artificially created primary keys if they have meaning and are part of the business model. GL account numbers, Customer numbers, company account numbers, credit card numbers, SSNs, and EINs are all examples of these sorts of keys, but you also notice that they have taken on real meaning in the real world. What we are arguing against are number sequences that exist in the database and the application code that never reach the outside world and have no real reason to exist. They make the SQL meaningless, they make the application code meaning less. Since you basically have to create a key on the natural key in the DB anyway, to preserve the integity of the data, why waste DB resources creating and storing an extra row and indexing structures to support the artificial meaningless key.

Because GL account numbers, Customer numbers, etc etc etc are subject to change (provably; I have seen both of 'em actually change at multiple clients for multiple good reasons).

RevMike:

If the data already contains a true natural key, then adding an artificial key breaks 3NF (or perhaps BCNF, I don't remember which). A row should be uniquely identified by one and only one key.

You're claiming that no table should have any alternate keys? I don't believe any of the NFs claim that. I'm using Wikipedia's article on "database normalization" as a reference here - if you have a cite to the contrary, please post a link to it.

Even if it does break one of the NFs, it has practical value anyway, because the natural key that you would intutively pick as the primary key is so often subject to change. Also, if you make a consistent practice of using artificial keys of type int (or bigint, if any of your tables might ever reasonably get big enough to need it), then you can write table joins without having to remember or look up what type each table's key is.

RevMike · 2005-11-06 Reply Admin

emurphy:

RevMike:

If the data already contains a true natural key, then adding an artificial key breaks 3NF (or perhaps BCNF, I don't remember which). A row should be uniquely identified by one and only one key.

You're claiming that no table should have any alternate keys? I don't believe any of the NFs claim that. I'm using Wikipedia's article on "database normalization" as a reference here - if you have a cite to the contrary, please post a link to it.

I don't need a link to the contrary. Your own link has the reference I need. Please look at the section on BCNF.

wikipedia on Database normalization:

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.

If there are both natural and artificial keys, then attributes can be expressed as dependancies on either key. Therefore, by your own reference article, a table should not have alternate keys.

emurphy:

Even if it does break one of the NFs, it has practical value anyway, because the natural key that you would intutively pick as the primary key is so often subject to change.

I've already pointed out in a prior post (probably page 2) the danger of choosing a natural "near-key" as a key. My particular bad experience has been with using SSN as a natural key for person. It works just well enough to become really entrenched in a major system before it bites you in the ass.

emurphy:
Also, if you make a consistent practice of using artificial keys of type int (or bigint, if any of your tables might ever reasonably get big enough to need it), then you can write table joins without having to remember or look up what type each table's key is.

This I don't understand. Why do I need to know the type of a key column in order to write a join? I've never seen that in any SQL I've ever written. Of course some idiot could make a foreign key of a different type, but that is a WTF itself. Please provide me with an example of a major SQL dialect (DB2 (all three flavors), Oracle, Sybase, SqlServer, MySql, Postgres, Informix, Terradata) where the type of a column is important when writing a join.

RevMike · 2005-11-06 Reply Admin

Sorry to reply twice. I hit post, then remembered the other half of my counter argument...

emurphy:
RevMike:

If the data already contains a true natural key, then adding an artificial key breaks 3NF (or perhaps BCNF, I don't remember which). A row should be uniquely identified by one and only one key.

You're claiming that no table should have any alternate keys? --snipped--

Even if it does break one of the NFs, it has practical value anyway, because the natural key that you would intutively pick as the primary key is so often subject to change. Also, if you make a consistent practice of using artificial keys of type int (or bigint, if any of your tables might ever reasonably get big enough to need it), then you can write table joins without having to remember or look up what type each table's key is.

When you are dealing with sufficiently large tables, you in fact don't want to use an artificial key when a suitable natural key is available. A natural key, or more likely a component of a natural composite key, is likely to provide you with the predicate you need to do reasonable partitioning. Partitioning on a artificial key is pretty useless, since that column is already indexed and is only referenced by to perform joins anyway. On the other hand, a natural key can help your database engine narrow a search to a partition very quickly.

As an example, say you have a transaction table which has a composite key of account, transaction date, and transaction sequence number. If you need to search for a particular det of transactions, and you have a transaction date range, your artificially keyed table will need to do a full scan, or you'll need to add a transaction date index. Since there may be hundreds of thousands of transactions per transaction date, the transaction date index offers some help, but still you'll resort to scanning lots of data because either you need to merge those rows with results from other indexes or scan a large number of blocks.

On the other hand, if you partition on transaction date, your engine will first select the appropriate partition(s), and then search only the indexes in that partition. Your performance goes up dramatically.

paddy · 2005-11-06 Reply Admin

For me, I am generally suspecious of natural keys, because they make assumptions that generally can be wrong, or can change.

The biggest concerns are that:

almost all the time, all columns of data can change
sometimes, duplicate rows should exist.
Clients will often ask for changes that makes one's head explode

Lets say, you print an address, and you want 'CA' for the state to show up - and quite nicely, you've used the abbreviation as your natural key for your state table, and you don't even have to do a JOIN to print the state in the address.

Then, the client asks for something retarded, like 'Ca' instead of 'CA' to be printed up because he thinks all caps looks unprofessional. Never underestimate the sorts of things clients will do or ask for.

Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two identical records.

Take
int a = 0;
int b = 0;
They are not the same - only their data is the same, if you wanted to model that in a database you'd need something like:
table ints:
memoryspace value
1 0
2 0

To me, it seems like, natural keys are an unnatural way to get around the fact you are dealing with seperate entities - trying to make the data values reflect its unique purpose. Thats nice, but I like to think in terms that a customer row entry is an Instance of a customer record, and therefore to the database, the fact it is a unique instance should be recognized, - heck, its natrual when you consider the business logic recognizes two seperate entities as seperate, regardless of their data - and when its so easy to just have an instance ID in that table - why would you ever want to cripple the flexibility later? I mean, you have to assume at least one column will never change and always be unique. Where is the advantage in that, when the client can very well change that factor down the road?

paddy · 2005-11-06 Reply Admin

Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two identical records.

should read:

Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two possibly identical records.

foxyshadis · 2005-11-07 Reply Admin

I just wanted to add: It really isn't that difficult to cascade primary key changes in any modern database. (Even mysql now!) You change the key and you go on with your life without worrying about any of this crap. If you're working with other databases it simply takes a little planning and time to cascade them via a script, and these are supposed to be the values that hardly ever change, so it shouldn't be a big burden to change every few years or decades.

If your applications have their primary key baked in and recompiling/redistributing is out of the question, you're probably going to have maintenance nightmares long before the need to change natural primary keys comes in.

emurphy · 2005-11-07 Reply Admin

RevMike:
emurphy:

RevMike:

If the data already contains a true natural key, then adding an artificial key breaks 3NF (or perhaps BCNF, I don't remember which). A row should be uniquely identified by one and only one key.

You're claiming that no table should have any alternate keys? I don't believe any of the NFs claim that. I'm using Wikipedia's article on "database normalization" as a reference here - if you have a cite to the contrary, please post a link to it.

I don't need a link to the contrary. Your own link has the reference I need. Please look at the section on BCNF.

wikipedia on Database normalization:

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.

If there are both natural and artificial keys, then attributes can be expressed as dependancies on either key. Therefore, by your own reference article, a table should not have alternate keys.

Both the natural and artificial keys are candidate keys; whichever one you don't mark as the primary key is, by definition, an alternate key. Even 5NF talks about "candidate keys", plural; how does this jibe with "alternate keys are always bad"?

RevMike:

emurphy:

Even if it does break one of the NFs, it has practical value anyway, because the natural key that you would intutively pick as the primary key is so often subject to change.

I've already pointed out in a prior post (probably page 2) the danger of choosing a natural "near-key" as a key. My particular bad experience has been with using SSN as a natural key for person. It works just well enough to become really entrenched in a major system before it bites you in the ass.

If you're saying that changeable external data is not only a poor choice of primary key, but not a key at all (and anyone/thing that labels it as one is not only guilty of a poor choice, but is flat-out false), then that would seem to eliminate yet another big chunk of key arguments (albeit opening up another big chunk in its place).

RevMike:

emurphy:
Also, if you make a consistent practice of using artificial keys of type int (or bigint, if any of your tables might ever reasonably get big enough to need it), then you can write table joins without having to remember or look up what type each table's key is.

This I don't understand. Why do I need to know the type of a key column in order to write a join?

Sorry, thinko on my part; I should have said that you can design new tables without having to remember or look up what types its FKs should have.

RevMike:

When you are dealing with sufficiently large tables, you in fact don't want to use an artificial key when a suitable natural key is available. A natural key, or more likely a component of a natural composite key, is likely to provide you with the predicate you need to do reasonable partitioning. Partitioning on a artificial key is pretty useless, since that column is already indexed and is only referenced by to perform joins anyway. On the other hand, a natural key can help your database engine narrow a search to a partition very quickly.

I never explicitly do any of this stuff. The real-world package that I'm using as a mental model, whatever it builds into the default install is plenty sufficient (one of my current projects joins something like ten tables and millions of records in thirty-ish seconds).

How do you define partitions? Does an artificial PK prevent you from partitioning on natural data? How do you determine whether adding a certain index will likely give enough of a performance gain to offset the performance cost of maintaining that index during table writes?

2005-11-07 Reply Admin

Alex Papadimoulis:
There are those who believe that each and every table in a relational database must have a "uniquifier" Primary Key (IDENTITY, SEQUENCE, GUID, and so on) and there are those who actually know how to design a relational database. As Dave points out, the trouble comes when the former group is actually allowed to design databases.

Nice flamebait!

ammoQ · 2005-11-07 Reply Admin

foxyshadis:
I just wanted to add: It really isn't that difficult to cascade primary key changes in any modern database. (Even mysql now!) You change the key and you go on with your life without worrying about any of this crap.

*lol* In theory, yes, you can do that. In practice, it means updating millions of records, causing deadlocks, exploding your rollback segments, freezing the app (because of the locks) for minutes.

If you're working with other databases it simply takes a little planning and time to cascade them via a script, and these are supposed to be the values that hardly ever change, so it shouldn't be a big burden to change every few years or decades.

In theory, theory an practice are equal. In practice, theory and practice differ. Things that can change at all have a tendency to change more often than you think, at inconvinient times.

If your applications have their primary key baked in and recompiling/redistributing is out of the question, you're probably going to have maintenance nightmares long before the need to change natural primary keys comes in.

It's by far less effort to change all input/output routines that are affected by the structure of the "natural key" of a certain table than changing all other places where the primary key is used to make joins.

2005-11-07 Reply Admin

mmm. i don't really agree with that sentiment.

while you're correct to say that a seperate logial key is not needed for join / mapping tables the problem comes when the relationship between the 2 tables starts to mean something in itself.

if you ever need to create a child table or some other sort of link to the mapping row you end up having to use composite foreign keys, which just make the model horrible in my mind.

2005-11-07 Reply Admin

For those that don't believe in tables of dates, lets demo a quite common scenario.

You have a financial transaction history. A real simple one.

TransactionDateTime, Amount.

You can have a transactionId if it makes you happy.

What we need is an output going back to the date of formation of the company of the number of transactions and the value of the transactions broken down by year, month and day. Additionally we need a breakdown of activity by time of day in 10 minute chunks aggregated over any given period.

Due to some internal/external requirement we need to know if 0 events happened in any of those time periods/slots.

How will you code that?

RevMike · 2005-11-07 Reply Admin

emurphy:
RevMike:

When you are dealing with sufficiently large tables, you in fact don't want to use an artificial key when a suitable natural key is available. A natural key, or more likely a component of a natural composite key, is likely to provide you with the predicate you need to do reasonable partitioning. Partitioning on a artificial key is pretty useless, since that column is already indexed and is only referenced by to perform joins anyway. On the other hand, a natural key can help your database engine narrow a search to a partition very quickly.

I never explicitly do any of this stuff. The real-world package that I'm using as a mental model, whatever it builds into the default install is plenty sufficient (one of my current projects joins something like ten tables and millions of records in thirty-ish seconds).

The fact that the default install is good enough tells me that you are probably working with a database that is closer to medium or medium-large. I've done plenty of these system in my career, but I'm currently doing consulting for three of the five wall street banks. I'm not working with the actual trading systems (which have lots of special magic) but I am dealing with an ODS loaded nightly from the trading systems. This system grows by several tens of millions of rows nightly. These types of systems are large, but not huge. The really big systems handle airline reservations, for instance, and credit card transactions.

emurphy:

How do you define partitions? Does an artificial PK prevent you from partitioning on natural data? How do you determine whether adding a certain index will likely give enough of a performance gain to offset the performance cost of maintaining that index during table writes?

The artificial key does not prevent partitioning on natural data, but can mean that an additional attribute has to be carried into the table just for the purpose of partitioning. As far as maintence cost during table writes, partitioning helkps reduce this as well. The indexes can be maintained against individual partitions instead of the the entire table, so they are smaller and easier to maintain. The performance gain comes when most queries only scan a subset of indexes instead of the entire table.

ammoQ · 2005-11-07 Reply Admin

Anonymous:
For those that don't believe in tables of dates, lets demo a quite common scenario.

You have a financial transaction history. A real simple one.

TransactionDateTime, Amount.

You can have a transactionId if it makes you happy.

What we need is an output going back to the date of formation of the company of the number of transactions and the value of the transactions broken down by year, month and day. Additionally we need a breakdown of activity by time of day in 10 minute chunks aggregated over any given period.

Due to some internal/external requirement we need to know if 0 events happened in any of those time periods/slots.

How will you code that?

select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=

ammoQ · 2005-11-07 Reply Admin

select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=b.TransactionDatetime(+) and a.tsttill>b.TransactionDateTime(+) and a.tstfrom<sysdate
group by a.tstfrom

Jeff S · 2005-11-07 Reply Admin

paddy:
Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two identical records.

should read:

Personally, I consider artificial keys just a way to model a relevant fact in the data: That you are dealing with two possibly identical records.

Paddy, you need to go back and read all of my posts about what the "problem" is with simply tagging ID's onto all your tables and ignoring data integrity.

Jeff S · 2005-11-07 Reply Admin

Anonymous:
mmm. i don't really agree with that sentiment.

while you're correct to say that a seperate logial key is not needed for join / mapping tables the problem comes when the relationship between the 2 tables starts to mean something in itself.

if you ever need to create a child table or some other sort of link to the mapping row you end up having to use composite foreign keys, which just make the model horrible in my mind.

Another in the "A primary key must be a single ID column" crowd. If you feel composite primary keys are a "horrible model", then (and I apologize for this, but it is true) you know absolutely nothing about relational databases.

You can make arguements about physical limitations of large keys, but to use the word "model" in the above statement is completely ignorant.

Jeff S · 2005-11-07 Reply Admin

ammoQ:
select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=b.TransactionDatetime(+) and a.tsttill>b.TransactionDateTime(+) and a.tstfrom<SYSDATE<BR> group by a.tstfrom

Ah! Now your perspective makes sense... you're an Oracle guy! So you don't normally need to deal with relational data or true joins, you have "rownum" and where clauses with + symbols in them!

[:)]

For those of us using relational databases, there are often huge advantages in having date tables or, more generically, number tables stored permanently for SELECT's like the one mentioned.

ammoQ · 2005-11-07 Reply Admin

Jeff S:
ammoQ:
select to_char(a.tstfrom, 'YYYY-MM-DD HH24:MI'), count(b.transactionId), nvl(sum(b.amount),0)
from (select :formationdate+rownum/144 tstfrom, :formationdate+(rownum+1)/144 tsttill from all_objects, all_objects) a, transactions b
where a.tstfrom<=b.TransactionDatetime(+) and a.tsttill>b.TransactionDateTime(+) and a.tstfrom<sysdate ="">
group by a.tstfrom

</sysdate>

Ah! Now your perspective makes sense... you're an Oracle guy! So you don't normally need to deal with relational data or true joins, you have "rownum" and where clauses with + symbols in them!

[:)]

For those of us using relational databases, there are often huge advantages in having date tables or, more generically, number tables stored permanently for SELECT's like the one mentioned.

The given SQL-statemtent (incomplete, thanks to the ID-iotic forum software) is probably not the way I would do it in real world, just an example to show that it is possible. The (+) thingy in the where clause to indicate outer join kinda sucks, no question.
Anyway, the rownum pseudo column fits into the relational modell just as well as a number table; or the possibility to read only a part of a result set by exiting the loop, for the matter.

Jeff S · 2005-11-07 Reply Admin

ammoQ:
... or the possibility to read only a part of a result set by exiting the loop, for the matter.

What loop? You lost me there.

d4ddyo · 2005-11-07 Reply Admin

The obvious solution here is a duel at dawn. My dad can beat up your dad.

I tend to agree with a previous poster who tags the initial WTF post as flamebait. Though it may be "technically" true that artificial IDs and "correct" relational database modeling are mutually exclusive, the tone of the post implies that those who, for sound practical reasons, employ "artificial" IDs are complete boobs. It smacks of snobbery and elitism. Sure, we all need to read up on the fundamentals, but if you can't take the time to summarize the factual points about why either method is good/bad, please don't stir up the mud. Then again, this is TDWTF, and perhaps not the place to help enlighten others - only to tell us how much we need enlightening.

As in everything, there is a "correct" way of doing something, and a "practical" way of doing something. I'm sure the roadways in my city are designed by REALLY SMART PEOPLE, using proven mathematical algorithms to test flow patterns and such. The problem is, nobody drives like the models say they should - so congestion just gets reappropriated to other parts of the city or interstate.

It's no different with real-world data and databases. The best designed systems often fall apart under everyday use for any number of reasons. So, maybe my databases are not "real" RDBs at all - but that doesn't mean that they are any less useful or meaningful.

Someone used a "world is flat/round" metaphor a while back. I don't think that analogy was fully correct. If nobody ever took the time to question the "authority" on the shape of the earth, we may well still believe it was flat.

ammoQ · 2005-11-07 Reply Admin

Jeff S:
ammoQ:
... or the possibility to read only a part of a result set by exiting the loop, for the matter.

What loop? You lost me there.

select x from y where rownum<=10;

is mostly the same as opening a cursor for select x from y, processing the first 10 rows and exiting the loop then.
In both cases, you are working on an randomly choosen subset of 10 rows, something that can hardly be expressed in terms of set operations, as relational theory requires.

Jeff S · 2005-11-07 Reply Admin

ammoQ:
Jeff S:

ammoQ:
... or the possibility to read only a part of a result set by exiting the loop, for the matter.

What loop? You lost me there.

select x from y where rownum<=10;

is mostly the same as opening a cursor for select x from y, processing the first 10 rows and exiting the loop then.
In both cases, you are working on an randomly choosen subset of 10 rows, something that can hardly be expressed in terms of set operations, as relational theory requires.

OK, I'm lost ... you've demonstrated a worse way to write SELECT's rather than using a date or numbers table to make some sort of point, and then mentioned the advantages of "exiting a loop" by demonstrating a way to get 10 meaningless, semi-random rows from a table to make another. I am not sure what those points are, unfortunately ... Relational DB's are "bad"? Cursors are "good"? That you don't know about "SELECT TOP" ? This proves all tables must have an "ID" ? Tie it all together and let's here your thesis!

Jeff S · 2005-11-07 Reply Admin

Jeff S:
OK, I'm lost ... you've demonstrated a worse way to write SELECT's rather than using a date or numbers table to make some sort of point, and then mentioned the advantages of "exiting a loop" by demonstrating a way to get 10 meaningless, semi-random rows from a table to make another. I am not sure what those points are, unfortunately ... Relational DB's are "bad"? Cursors are "good"? That you don't know about "SELECT TOP" ? This proves all tables must have an "ID" ? Tie it all together and let's ~~here~~ hear your thesis!

oops

UncleMidriff · 2005-11-07 Reply Admin

RevMike:
emurphy:

RevMike:

If the data already contains a true natural key, then adding an artificial key breaks 3NF (or perhaps BCNF, I don't remember which). A row should be uniquely identified by one and only one key.

You're claiming that no table should have any alternate keys? I don't believe any of the NFs claim that. I'm using Wikipedia's article on "database normalization" as a reference here - if you have a cite to the contrary, please post a link to it.

I don't need a link to the contrary. Your own link has the reference I need. Please look at the section on BCNF.

wikipedia on Database normalization:

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.

If there are both natural and artificial keys, then attributes can be expressed as dependancies on either key. Therefore, by your own reference article, a table should not have alternate keys.

I think I am going to have to disagree with you there. The definition of BCNF you cite states, "At this stage, all attributes are dependent on a key, a whole key and nothing but a key" (emphasis added by me).

Any given table can have any number of attributes or combinations of attributes which are sufficient to uniquely identify any row. These are called candidate keys, and from them we can select a primary key. However, all the candidate keys we did not select are still going to be in the table, and we could just as easily uniquely identify a row in the table using one of them as we could do the same using the primary key.

Adding an artificial key does not break BCNF as defined above. It merely adds another candidate key to the table.

I understand that wikipedia might not be the best relational database reference, so I'm open to the idea that the aforementioned definition of BCNF might not be correct. If you think it isn't, by all means, post a reference that says so; I'd be very interested in reading it.

foxyshadis · 2005-11-07 Reply Admin

ammoQ:
foxyshadis:
I just wanted to add: It really isn't that difficult to cascade primary key changes in any modern database. (Even mysql now!) You change the key and you go on with your life without worrying about any of this crap.

*lol* In theory, yes, you can do that. In practice, it means updating millions of records, causing deadlocks, exploding your rollback segments, freezing the app (because of the locks) for minutes.

If you're working with other databases it simply takes a little planning and time to cascade them via a script, and these are supposed to be the values that hardly ever change, so it shouldn't be a big burden to change every few years or decades.

In theory, theory an practice are equal. In practice, theory and practice differ. Things that can change at all have a tendency to change more often than you think, at inconvinient times.

If your applications have their primary key baked in and recompiling/redistributing is out of the question, you're probably going to have maintenance nightmares long before the need to change natural primary keys comes in.

It's by far less effort to change all input/output routines that are affected by the structure of the "natural key" of a certain table than changing all other places where the primary key is used to make joins.

I'm not a 3rd semester CS student talking out of my ass here. Account names are probably on the threshold, but ISBNs/UPCs, product codes or serial numbers, are things that simply don't change often. I could come up with others with a bit of morning stimulant. If you have a table with fields that change constantly (defined as "often enough that cascaded updates that cause a major performance impact can't be scheduled for downtimes") you'd be insane to make that the primary key.

I don't follow your third point, the way you phrased it is very confusing.

ammoQ · 2005-11-07 Reply Admin

Jeff S:
OK, I'm lost ... you've demonstrated a worse way to write SELECT's rather than using a date or numbers table to make some sort of point, and then mentioned the advantages of "exiting a loop" by demonstrating a way to get 10 meaningless, semi-random rows from a table to make another. I am not sure what those points are, unfortunately ... Relational DB's are "bad"? Cursors are "good"? That you don't know about "SELECT TOP" ? This proves all tables must have an "ID" ? Tie it all together and let's here your thesis!

SELECT TOP doesn't exist in Oracle, but "ROWNUM<=..." in the where clause does mostly the same. (Some caveats concerning ORDER BY exist)

So, ROWNUM in Oracle can be used
- instead of SELECT TOP
- to generate numbers for joins (can be abused to generate dates, too)
and some other arcane things.
IMO this doesn't make Oracle less relational than other so called database systems.
From a puristic relational point of view, one might argue that cursors are bad.
We're OT here, since all this says nothing about whether or not IDs are evil.

JohnO · 2005-11-07 Reply Admin

Anonymous:
Anonymous:
To me, the big "WTF" is the choice of data types. They are often wastefully large, and the consistency of types ("varchar(50) for everything!") just smacks of amateurishness that I can't believe got in the door of a "large multi-national bank."

Oh, but that's easy. It didn't have to enter through the door. Why? Large multi-national banks usually still run large amounts of COBOL code. Using files instead of databases. Using files with fixed-length records. It only takes a team of re-trained COBOL programmers to come up with this, nothing more. And they already have those. Trained in-house, never came in through any door!

Varchar doesn't waste space. I can't keep reading more posts harping on this without responding.

JohnO · 2005-11-07 Reply Admin

Let me clarify, varchar doesn't waste space for character data :).

2005-11-07 Reply Admin

Ok, this is a very interesting thread to me because I am NOT a database guy, though I have been called upon to design databases to support my applications. In these cases I always give it my best shot and then take it to someone I respect who has been around databases longer than I have to give constructive input. Actually, this is generally the same process I do with any design, but with databases it is more like approaching a teacher rather than a peer. I say all this to demonstrate I am very open and realize my knowledge is limited.

I've almost always used artificial keys, and none of the database people I've taken my designs to have ever balked at this. Which doesn't mean anything significant except that I am very surprised to read that some experienced developers apparently believe that not only are they bad, but that this is a fundamental issue in DB design. So I want to know more. I'll likely be re-designing a database here in a few months and I'd like to do a good job. Right now the database is VERY denormalized and has no foreign key constraints at all. I'm certain I can do better. But I'd really like to nail it and get it right, not just better.

So here's a very specific question from a real database I’ll be re-designing in the next six months: this application track bonds. Every bond is issued (externally) an identifier known as a Cusip. So clearly I will have a table for these bonds that contains the information. Now obviously a Cusip makes a nice primary key on the face of it, until you realize that sometimes Cusips are reused after a bond has matured and the two won’t be confused. So there you go- textbook case for a multi column natural key right? You can set the maturity date and the Cusip as the primary key and you’re done.

So now I store both pieces of information in every table that I want to relate to this bond, correct? The Cusip and maturity date pair will proliferate through the data and my joins will now be on both fields. Not what I’m used to, but I can handle that if it’s going to make my data integrity stronger. But what do I do when the user suddenly realizes that the Cusip or maturity date was entered wrong and this error has been proliferated throughout the entire database? My integrity is intact, but the data is very wrong and fixing it is going to take quite a bit of error-prone updates to the database. The more relations there are to this table, the more difficult it is going to be to make sure they are all changed correctly. I suppose I can just force the users to enter new records and do all of the updates by hand, but now I’m putting constraints on the user that are difficult to explain, especially when I’m not certain myself. But if I only stored a reference to the data (an identity column) rather than proliferate the errors throughout the database, the data is easy and safe to correct. So how to you (the the natural key camp) handle this situation which seems textbook made for a natural key?

On a side note, what is this about Stored Procs being bad for integrity? My manager has just deemed that almost everything we do should be in Stored Procs, so if there is some good information on why this is a bad thing (or questionable at least), it’d be good to know. <o:p></o:p>

I realize that a lot of people are tired of this discussion and feel like it’s “just another religious debate”, but I’m very interested in the perspectives of those with a lot more database experience than I.

UncleMidriff · 2005-11-07 Reply Admin

RevMike:
Jeff S:
UncleMidriff:

Now, what if Joe, Sue, Bubba, and Bob all live together and join your club. You'd have four rows in your address table all with the exact same data save for the Customer foreign key. While this would work well, and I can't forsee it causing any problems, it bugs me that we're storing the exact same address more than once. I might propose that we move all the address info to a separate table, append some kind of AddressID to it, and then add a reference to that AddressID to our first table. That way, all the same constraints are enforced, but instead of repeating the entire address 4 times, were only repeating the small AddressID value 4 times. What do you think? Have I gone off the deep end?

That's a great modelling question, and there's many ways to handle that, and it can be a good discussion. Unfortunately, I don't see where a single word I've written disallows this; that is very different from the storage of addresses scenerio I mentioned earlier. If each entity can have exactly one address, but some enties can share addresses, then indeed creating an address entity might be the way to go, and adding a surrogate key to that table would probably be a great idea, instead of having the key being all columns of course. The situation I refered to was where a single entity can have multiple addresses, and ways to model that situation.

I will repeat (yet again) that there are many situations in which surrogate keys are very useful.

Maybe I should just add that to my signature? It will save some typing.

I prefer to model this as a seperate address entry for each of Joe, Sue, Bob, and Bubba. The fundamental problem with modeling them as a single address is that were Bob to move out, your software would have to detect that the address is shared and modify the data for Bob while leaving Joe, Sue, and Bubba at the same address. This complicates your system needlessly.

First to address Jeff's post:

I didn't mean for that to sound accusatory or anything, so I apologize if it did. I was just interested in your take on my proposed changes to your design.

RevMike:

I'm not sure if you were addressing Jeff's design or mine, so I'm sorry if I'm missing the point.

In my proposed design, you'd have an address table and a table linking people to addresses. That table would have PersonName, AddressID, and AddressType columns. If Joe, Sue, Bob, and Bubba all live at the address with an AddressID of 123, then our table would have the following 4 rows in it:

Joe, 123, Home
Sue, 123, Home
Bob, 123, Home
Bubba, 123, Home

If Bob moves to another address, say the address with an AddressID of 456, then all you need to do is change Bob's row in our table so that we have:

Joe, 123, Home
Sue, 123, Home
Bob, 456, Home
Bubba, 123, Home

Of course, this will potentially leave you with addresses in the address table that aren't tied to anybody, and that's obviously not ideal. Anyway, how else is this needlessly complicating the system, or am I just misunderstanding you? If so, please let me know.

Jeff S · 2005-11-07 Reply Admin

Anonymous:

Ok, this is a very interesting thread to me because I am NOT a database guy, though I have been called upon to design databases to support my applications. In these cases I always give it my best shot and then take it to someone I respect who has been around databases longer than I have to give constructive input. Actually, this is generally the same process I do with any design, but with databases it is more like approaching a teacher rather than a peer. I say all this to demonstrate I am very open and realize my knowledge is limited.

I've almost always used artificial keys, and none of the database people I've taken my designs to have ever balked at this. Which doesn't mean anything significant except that I am very surprised to read that some experienced developers apparently believe that not only are they bad, but that this is a fundamental issue in DB design. So I want to know more. I'll likely be re-designing a database here in a few months and I'd like to do a good job. Right now the database is VERY denormalized and has no foreign key constraints at all. I'm certain I can do better. But I'd really like to nail it and get it right, not just better.

So here's a very specific question from a real database I’ll be re-designing in the next six months: this application track bonds. Every bond is issued (externally) an identifier known as a Cusip. So clearly I will have a table for these bonds that contains the information. Now obviously a Cusip makes a nice primary key on the face of it, until you realize that sometimes Cusips are reused after a bond has matured and the two won’t be confused. So there you go- textbook case for a multi column natural key right? You can set the maturity date and the Cusip as the primary key and you’re done.

So now I store both pieces of information in every table that I want to relate to this bond, correct? The Cusip and maturity date pair will proliferate through the data and my joins will now be on both fields. Not what I’m used to, but I can handle that if it’s going to make my data integrity stronger. But what do I do when the user suddenly realizes that the Cusip or maturity date was entered wrong and this error has been proliferated throughout the entire database? My integrity is intact, but the data is very wrong and fixing it is going to take quite a bit of error-prone updates to the database. The more relations there are to this table, the more difficult it is going to be to make sure they are all changed correctly. I suppose I can just force the users to enter new records and do all of the updates by hand, but now I’m putting constraints on the user that are difficult to explain, especially when I’m not certain myself. But if I only stored a reference to the data (an identity column) rather than proliferate the errors throughout the database, the data is easy and safe to correct. So how to you (the the natural key camp) handle this situation which seems textbook made for a natural key?

On a side note, what is this about Stored Procs being bad for integrity? My manager has just deemed that almost everything we do should be in Stored Procs, so if there is some good information on why this is a bad thing (or questionable at least), it’d be good to know. <?xml:namespace prefix = o /><o:p></o:p>

I realize that a lot of people are tired of this discussion and feel like it’s “just another religious debate”, but I’m very interested in the perspectives of those with a lot more database experience than I.

Without knowing more about the specifics of what you are storing, and how often these dates are "wrong" and need to be adjusted, it is hard to be able to give a good data model. That is the problem many people face with designing a db -- they come up with ideas or make decisions before considering carefully the whole of the system.

Keep in mind that in most modern RDMS's, foreign keys can cascade and automatically be updated with any changes in the primary key. Not sure about mySQL and Oracle, but SQL Server >=2000 does this. (In fact, even good old Access does this). So, if you have proper integrity set up with cascading updates, it should not be a problem.

It comes down to this: for two bonds with the same CuspID, are they related? Is there any connection between the two, other than being assigned the same number externally?

If yes, then you should have a table of CuspID's and all data that common amongst bonds with the same CuspID stored in it. Then, you would have a child table of the Bonds themselves with a pk of CuspID/Maturity Date.

IF the answer is no, there is no connection, then I would have a table of Bonds, unique constraint on CuspID/MaturityDate, and then probably add a BondID surrogate and let that identify each bond. Since two bonds with the same CuspID have no relation, there is no need for the two bonds to share a common parent in the key that uniquely identifies them.

See the difference?

ammoQ · 2005-11-07 Reply Admin

foxyshadis:

I'm not a 3rd semester CS student talking out of my ass here. Account names are probably on the threshold, but ISBNs/UPCs, product codes or serial numbers, are things that simply don't change often. I could come up with others with a bit of morning stimulant. If you have a table with fields that change constantly (defined as "often enough that cascaded updates that cause a major performance impact can't be scheduled for downtimes") you'd be insane to make that the primary key.

I don't follow your third point, the way you phrased it is very confusing.

ISBNs don't change. UPC/EAN/GTIN are a different kind of beast. But that's too off-topic to go into detail. Enough to say that there is a reason why each and every business maintains their own product number instead of relying on UPC.

The "often enough" thingy obviously depends on scheduled downtimes. Unfortunately, downtimes get shorter this times. Sooner or later the company starts selling through the internet and want its systems run 24/7. No more scheduled downtimes...

My third point says: If you build your system with "Product_number" as the primary key of the product table and join all other tables (stock items, order items, transaction journal etc.) to the product table on the "Product_number" column, it hurts badly when - after some reorganisation, company buys other company etc. - "Product_number" is no longer unique, but you have to use "Company_number"+"Product_number" instead. This literally fscks up your whole database and probably large parts of the program. On the other hand, if you've used that unloved ID column, you have to revise all programs that read or write the product number, so they also read resp. write the company number (where necessary); but other parts of the program are not affected.

Jeff S · 2005-11-07 Reply Admin

Anonymous:

On a side note, what is this about Stored Procs being bad for integrity?

You can safely ignore that comment; in fact, based on that statement, you might be wise to ignore any comments made by whoever wrote that.

JohnO · 2005-11-07 Reply Admin

Anonymous:
mmm. i don't really agree with that sentiment.
if you ever need to create a child table or some other sort of link to the mapping row you end up having to use composite foreign keys, which just make the model horrible in my mind.

Are you kidding? You can only have a composite foreign key if you have a composite primary key. So you are essentially saying that any model with compositie primary keys is horrible. You are ignorant.

haveworld · 2005-11-07 Reply Admin

brotha_s:
You know you've been reading TDWTF for awhile when you laugh out loud at "Gene Nixonko" :)

My sentiments exactly... I couldn't help but laugh at that myself.

2005-11-07 Reply Admin

JohnO:
Let me clarify, varchar doesn't waste space for character data :).

It most certainly can. Not necessarily will.

Consider your typical 1 character code. Shall we have that a char(1) or a varchar(1). Or indeed a varchar(50) with a constraint restricting it to a single character. Usually you will find that adding a variable length character field has all sorts of repercussions. In M$ SQL Server iirc it takes 12 bytes per row to add one or more varchar column to a table, and an additional 1 or 2 per field per row. So now the choice to use a varchar(1) instead of a char(1) on a table that is quite narrow to start with has some serious implications.

JohnO · 2005-11-07 Reply Admin

Anonymous:

JohnO:
Let me clarify, varchar doesn't waste space for character data :).

It most certainly can. Not necessarily will.

Consider your typical 1 character code. Shall we have that a char(1) or a varchar(1). Or indeed a varchar(50) with a constraint restricting it to a single character. Usually you will find that adding a variable length character field has all sorts of repercussions. In M$ SQL Server iirc it takes 12 bytes per row to add one or more varchar column to a table, and an additional 1 or 2 per field per row. So now the choice to use a varchar(1) instead of a char(1) on a table that is quite narrow to start with has some serious implications.

You know full well I wasn't addressing char(1) vs varchar(1). That wasn't the context of the comment.

2005-11-07 Reply Admin

Jeff S:
Anonymous:

Ok, this is a very interesting thread to me because I am NOT a database guy, though I have been called upon to design databases to support my applications. In these cases I always give it my best shot and then take it to someone I respect who has been around databases longer than I have to give constructive input. Actually, this is generally the same process I do with any design, but with databases it is more like approaching a teacher rather than a peer. I say all this to demonstrate I am very open and realize my knowledge is limited.

I've almost always used artificial keys, and none of the database people I've taken my designs to have ever balked at this. Which doesn't mean anything significant except that I am very surprised to read that some experienced developers apparently believe that not only are they bad, but that this is a fundamental issue in DB design. So I want to know more. I'll likely be re-designing a database here in a few months and I'd like to do a good job. Right now the database is VERY denormalized and has no foreign key constraints at all. I'm certain I can do better. But I'd really like to nail it and get it right, not just better.

So here's a very specific question from a real database I’ll be re-designing in the next six months: this application track bonds. Every bond is issued (externally) an identifier known as a Cusip. So clearly I will have a table for these bonds that contains the information. Now obviously a Cusip makes a nice primary key on the face of it, until you realize that sometimes Cusips are reused after a bond has matured and the two won’t be confused. So there you go- textbook case for a multi column natural key right? You can set the maturity date and the Cusip as the primary key and you’re done.

So now I store both pieces of information in every table that I want to relate to this bond, correct? The Cusip and maturity date pair will proliferate through the data and my joins will now be on both fields. Not what I’m used to, but I can handle that if it’s going to make my data integrity stronger. But what do I do when the user suddenly realizes that the Cusip or maturity date was entered wrong and this error has been proliferated throughout the entire database? My integrity is intact, but the data is very wrong and fixing it is going to take quite a bit of error-prone updates to the database. The more relations there are to this table, the more difficult it is going to be to make sure they are all changed correctly. I suppose I can just force the users to enter new records and do all of the updates by hand, but now I’m putting constraints on the user that are difficult to explain, especially when I’m not certain myself. But if I only stored a reference to the data (an identity column) rather than proliferate the errors throughout the database, the data is easy and safe to correct. So how to you (the the natural key camp) handle this situation which seems textbook made for a natural key?

On a side note, what is this about Stored Procs being bad for integrity? My manager has just deemed that almost everything we do should be in Stored Procs, so if there is some good information on why this is a bad thing (or questionable at least), it’d be good to know. <o:p></o:p>

I realize that a lot of people are tired of this discussion and feel like it’s “just another religious debate”, but I’m very interested in the perspectives of those with a lot more database experience than I.

Without knowing more about the specifics of what you are storing, and how often these dates are "wrong" and need to be adjusted, it is hard to be able to give a good data model. That is the problem many people face with designing a db -- they come up with ideas or make decisions before considering carefully the whole of the system.

Keep in mind that in most modern RDMS's, foreign keys can cascade and automatically be updated with any changes in the primary key. Not sure about mySQL and Oracle, but SQL Server >=2000 does this. (In fact, even good old Access does this). So, if you have proper integrity set up with cascading updates, it should not be a problem.

It comes down to this: for two bonds with the same CuspID, are they related? Is there any connection between the two, other than being assigned the same number externally?

If yes, then you should have a table of CuspID's and all data that common amongst bonds with the same CuspID stored in it. Then, you would have a child table of the Bonds themselves with a pk of CuspID/Maturity Date.

IF the answer is no, there is no connection, then I would have a table of Bonds, unique constraint on CuspID/MaturityDate, and then probably add a BondID surrogate and let that identify each bond. Since two bonds with the same CuspID have no relation, there is no need for the two bonds to share a common parent in the key that uniquely identifies them.

See the difference?

Some details- this will be SQL Server 2000 so cascading is available. The Cusips and maturity date are entered by hand, so I always have to assume they could be wrong. Two bonds with the same Cusip are NOT related- it's just a simple example of id re-use over time (much like SSNs). A child table would make no sense.

brazzy · 2005-11-07 Reply Admin

ammoQ:
ISBNs don't change.

I wouldn't be so sure there. Theoretically, different editions (such as hardcover/paperback) of the same book must have different ISBNs (this may already qualify as "change", depending on your needs), while different printings of the same edition must use the same ISBN even though they may not be identical (e.g. price, errata), and ISBNs must never be reused.

In practice, the difference between edition and printing may not be totally clear or get disregarded, and I've even heard that there have been cases where ISBNs were reused for completely different books.

Ytram · 2005-11-07 Reply Admin

Jeff - The Anonymous One:
The Cusips and maturity date are entered by hand, so I always have to assume they could be wrong.

This might possibly be the best reason I can think of to not use natural keys(in general of course). At some point, a user must input the value for the natural key, and we all know how stupid users are. No matter what we do to validate input(double-key, regex, etc.) users will still be able to enter incorrect values, and depending on the number of users, it could be fairly frequently.

If we use surrogate keys, we will find that correction of invalid input becomes a lot easier because we don't have to worry about the foreign key in another table when we fix the invalid input. If the user enters "HORE", and this value is used as a foreign key in child tables, a cascade update might possibly be troublesome. But if the value he enters is simply for a column with a unique constraint with a surrogate key representing that row, he can go back and easily change it to "HORSE"(what did you think it was? sicko!) without worrying about the possible troubles involved in a cascade update.

Of course, let me qualify all my statements as a person who believes surrogate keys are preferred, not required for all tables, yadda yadda yadda....

Jeff S · 2005-11-07 Reply Admin

Anonymous:
Some details- this will be SQL Server 2000 so cascading is available. The Cusips and maturity date are entered by hand, so I always have to assume they could be wrong. Two bonds with the same Cusip are NOT related- it's just a simple example of id re-use over time (much like SSNs). A child table would make no sense.

Well, then, there you go. The key to this (no pun intended) is that the two cuspid are not related. Therefore, you would not want a composite key, since they would have no common parent.

Sounds like a surrogate might be a great choice here (in addition to that unique constraint on cuspid/maturation date -- don't forget data integrity!) , unless someone else about the Bond serves as a nice natural pk.

RevMike · 2005-11-07 Reply Admin

UncleMidriff:

I'm not sure if you were addressing Jeff's design or mine, so I'm sorry if I'm missing the point.

In my proposed design, you'd have an address table and a table linking people to addresses. That table would have PersonName, AddressID, and AddressType columns. If Joe, Sue, Bob, and Bubba all live at the address with an AddressID of 123, then our table would have the following 4 rows in it:

Joe, 123, Home
Sue, 123, Home
Bob, 123, Home
Bubba, 123, Home

If Bob moves to another address, say the address with an AddressID of 456, then all you need to do is change Bob's row in our table so that we have:

Joe, 123, Home
Sue, 123, Home
Bob, 456, Home
Bubba, 123, Home

Of course, this will potentially leave you with addresses in the address table that aren't tied to anybody, and that's obviously not ideal. Anyway, how else is this needlessly complicating the system, or am I just misunderstanding you? If so, please let me know.

Part of the problem with this, aside from the orphan row problem, is that every address change involves a lookup of all addresses to determine whether there is a matching record already. The logic would need to create the new address record or simply update the relationship to the existing record. This relatively expensive search needs to be serialized, otherwise two users could simultaneously enter duplicate addresses. Now your overhead is way up and throughput is limited.

I was actually thinking about it in the opposite way. Rather than create a new address we could update the existing address record.

When I model, I don't like to reduce coincidental relationships. There is nothing fundamentally wrong with doing so, I just don't like to. From your example I inferred that Joe, Sue, Bob, and Bubba are four different people with four distinct relationships that we are trying to track. I don't see a good purpose in reducing them to a single detail. If the relationship was more than coincidental, however, I have no problem with them sharing an address through an intermediate table describing the relationship, e.g. a Household table.

As I think about it more, I'd really like to put the address type on the address table itself. I don't like having the Many-To-Many table there. I think this is more intuitive....

Person
PersonID, PersonName, BirthDate
1, Joe, 1/1/1970
2, Sue, 1/2/1971
3, Bob, 1/3/1972
4, Bubba, 1/4/1973

PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223
3, Home, 123 Main St, Anywhere, NY, 11223
4, Home, 123 Main St, Anywhere, NY, 11223

Now, to update Bob's home address, I just update to this...

PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223

3, Home, 999 Broadway, Anywhere, CA, 99999

4, Home, 123 Main St, Anywhere, NY, 11223

Note that this update doesn't require any large searchs and happens cleanly inside a very small transaction. If my platform support row level locking, other updates to other records happen simultaneously.

Depending on the application and platform I may or may not create a domain table for address type, potentially using a constraint or a user defined type to limit the values. I'm not religious about this for an extremely small set of values. In this case I'm thinking the AddressTypes might be limited to Home, Office, and Other. If I do make the AddressType a foreign key reationship, I'd likely use a small character field rather than an ID. It is admittedly a crutch, but one which eases the work of maintainers. I don't mind compromising normal form, I mind when normal from is compromised without consideration of the impact.

RevMike · 2005-11-07 Reply Admin

JohnO:
Let me clarify, varchar doesn't waste space for character data :).

Depending on your platform and version, this choice can have a dramatic effect on performance. Oracle has handled VARCHAR types very cleanly since 7.0. Sybase only got good varchar handling in version 12. (I'm not sure where SqlServer sits today.)

Oracle early on allowed variable length records within a database block. Varchars are stored within the row. Older Sybase, however, didn't store the varchar in the row, but only a pointer to the Varchar. The data itself was in another block. So reading an Oracle row took only one read operation, while Sybase would potentially perform that read, then go back to disk and do a second read.

RevMike · 2005-11-07 Reply Admin

Ytram:
Jeff - The Anonymous One:
The Cusips and maturity date are entered by hand, so I always have to assume they could be wrong.

This might possibly be the best reason I can think of to not use natural keys(in general of course). At some point, a user must input the value for the natural key, and we all know how stupid users are. No matter what we do to validate input(double-key, regex, etc.) users will still be able to enter incorrect values, and depending on the number of users, it could be fairly frequently.

If we use surrogate keys, we will find that correction of invalid input becomes a lot easier because we don't have to worry about the foreign key in another table when we fix the invalid input. If the user enters "HORE", and this value is used as a foreign key in child tables, a cascade update might possibly be troublesome. But if the value he enters is simply for a column with a unique constraint with a surrogate key representing that row, he can go back and easily change it to "HORSE"(what did you think it was? sicko!) without worrying about the possible troubles involved in a cascade update.

Of course, let me qualify all my statements as a person who believes surrogate keys are preferred, not required for all tables, yadda yadda yadda....

I'm not sure I see why this would be a huge problem. If a record labeled "HORE" wa entered incorrectly, the user would simply create a new record labeled "HORSE" and go on without a problem. This maintains history. (I have a bias, having spent 5 years maintaining an accounting system that never had a delete.) :)

Jeff S · 2005-11-07 Reply Admin

RevMike:
UncleMidriff:

I'm not sure if you were addressing Jeff's design or mine, so I'm sorry if I'm missing the point.

In my proposed design, you'd have an address table and a table linking people to addresses. That table would have PersonName, AddressID, and AddressType columns. If Joe, Sue, Bob, and Bubba all live at the address with an AddressID of 123, then our table would have the following 4 rows in it:

Joe, 123, Home
Sue, 123, Home
Bob, 123, Home
Bubba, 123, Home

If Bob moves to another address, say the address with an AddressID of 456, then all you need to do is change Bob's row in our table so that we have:

Joe, 123, Home
Sue, 123, Home
Bob, 456, Home
Bubba, 123, Home

Of course, this will potentially leave you with addresses in the address table that aren't tied to anybody, and that's obviously not ideal. Anyway, how else is this needlessly complicating the system, or am I just misunderstanding you? If so, please let me know.

Part of the problem with this, aside from the orphan row problem, is that every address change involves a lookup of all addresses to determine whether there is a matching record already. The logic would need to create the new address record or simply update the relationship to the existing record. This relatively expensive search needs to be serialized, otherwise two users could simultaneously enter duplicate addresses. Now your overhead is way up and throughput is limited.

I was actually thinking about it in the opposite way. Rather than create a new address we could update the existing address record.

When I model, I don't like to reduce coincidental relationships. There is nothing fundamentally wrong with doing so, I just don't like to. From your example I inferred that Joe, Sue, Bob, and Bubba are four different people with four distinct relationships that we are trying to track. I don't see a good purpose in reducing them to a single detail. If the relationship was more than coincidental, however, I have no problem with them sharing an address through an intermediate table describing the relationship, e.g. a Household table.

As I think about it more, I'd really like to put the address type on the address table itself. I don't like having the Many-To-Many table there. I think this is more intuitive....

Person
PersonID, PersonName, BirthDate
1, Joe, 1/1/1970
2, Sue, 1/2/1971
3, Bob, 1/3/1972
4, Bubba, 1/4/1973

PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223
3, Home, 123 Main St, Anywhere, NY, 11223
4, Home, 123 Main St, Anywhere, NY, 11223

Now, to update Bob's home address, I just update to this...

PersonAddress
PersonID, AddressType, Street, City, State, Zip
1, Home, 123 Main St, Anywhere, NY, 11223
2, Home, 123 Main St, Anywhere, NY, 11223
3, Home, 999 Broadway, Anywhere, CA, 99999
4, Home, 123 Main St, Anywhere, NY, 11223

Note that this update doesn't require any large searchs and happens cleanly inside a very small transaction. If my platform support row level locking, other updates to other records happen simultaneously.

Depending on the application and platform I may or may not create a domain table for address type, potentially using a constraint or a user defined type to limit the values. I'm not religious about this for an extremely small set of values. In this case I'm thinking the AddressTypes might be limited to Home, Office, and Other. If I do make the AddressType a foreign key reationship, I'd likely use a small character field rather than an ID. It is admittedly a crutch, but one which eases the work of maintainers. I don't mind compromising normal form, I mind when normal from is compromised without consideration of the impact.

The question to ask is: "Are addresses attributes or entities?"

If an address is just an ATTRIBUTE of an entity, just info you just want to store for a customer, then you either a) put it on the customer table or b) create a child addresses table of customers so you can store many addresses (1 per type, of course; pk = customer/AddressType). That situation is similiar to what RevMike is describing.

If an address is an ENTITY, something that you want to keep track of, something that you want other entities in your database to potentially share, then you have a separate Addresses table (pk of addressID if you want, be sure to add unqiue constraints) and you have a situation similiar to what Uncle M has described where you relate entities to addresses. Either each customer has an AddressID column in the table (for 1 address per customer), or you store multiple addresses per customer in a table with a pk of (Customer/AddressID/Type).

See the difference? It depends on your needs; both situations are correct.

Consider a phone number: the area code is shared among phone numbers. Should you create an area codes table? You do the same test: if AreaCode is an entity you need to track (i.e., to store the state or area or long distance info) then you have a table of AreaCodes (hopefully, with a pk of AreaCode, not an AreaCodeID identity!) and you relate. But if you are only storing the area code as an attribute, then you just have a column called areacode in your tables and you fill it in and don't worry about 30 phone numbers having "duplicate" information.

RevMike · 2005-11-07 Reply Admin

Jeff S:
The question to ask is: "Are addresses attributes or entities?"
If an address is just an ATTRIBUTE of an entity, just info you just want to store for a customer, then you either a) put it on the customer table or b) create a child addresses table of customers so you can store many addresses (1 per type, of course; pk = customer/AddressType). That situation is similiar to what RevMike is describing.

If an address is an ENTITY, something that you want to keep track of, something that you want other entities in your database to potentially share, then you have a separate Addresses table (pk of addressID if you want, be sure to add unqiue constraints) and you have a situation similiar to what Uncle M has described where you relate entities to addresses. Either each customer has an AddressID column in the table (for 1 address per customer), or you store multiple addresses per customer in a table with a pk of (Customer/AddressID/Type).

See the difference? It depends on your needs; both situations are correct.

You make a great point, Jeff S.

I think the greater number of systems out there consider address as an attribute of the customer or employee entity. However, it isn't hard to think of systems that would use address as an entity in an of itself. Immediately I can think of a "ServiceLocation" table for a utility company or a Property table for a proeprty tax authority.

Jeff S · 2005-11-07 Reply Admin

RevMike:
Jeff S:
The question to ask is: "Are addresses attributes or entities?"
If an address is just an ATTRIBUTE of an entity, just info you just want to store for a customer, then you either a) put it on the customer table or b) create a child addresses table of customers so you can store many addresses (1 per type, of course; pk = customer/AddressType). That situation is similiar to what RevMike is describing.

If an address is an ENTITY, something that you want to keep track of, something that you want other entities in your database to potentially share, then you have a separate Addresses table (pk of addressID if you want, be sure to add unqiue constraints) and you have a situation similiar to what Uncle M has described where you relate entities to addresses. Either each customer has an AddressID column in the table (for 1 address per customer), or you store multiple addresses per customer in a table with a pk of (Customer/AddressID/Type).

See the difference? It depends on your needs; both situations are correct.

You make a great point, Jeff S.

I think the greater number of systems out there consider address as an attribute of the customer or employee entity. However, it isn't hard to think of systems that would use address as an entity in an of itself. Immediately I can think of a "ServiceLocation" table for a utility company or a Property table for a proeprty tax authority.

Yes! In fact, I have found that addresses themselves are almost never entities, but rather attributes of locations, like you mention. Especially since a location can have multiple addresses (shipping, mailing, physical), etc. That's a great point and something I should have mentioned: if you think an address is an entity, then you are most likely wrong and you want to track locations with (potentially multiple) address attributes.

2005-11-07 Reply Admin

Jeff S:

Well, then, there you go. The key to this (no pun intended) is that the two cuspid are not related. Therefore, you would not want a composite key, since they would have no common parent.

Let me try to rephrase this as a general principal, and please let me know if I get it right. If a row in a database can be identified based on a single column, then that column should be used as your primary key. If there are multiple columns then you might make a compisite key, depending on whether or not rows with a matching value for part of the key would share some data. If they would not, then an artifical key makes more sense (yet still putting the contstraint on your data to keep from duplicating rows you shouldn't).

If I understand this correctly then wouldn't at least part of each of your composite keys always exist in a parent table as a primary key?

Also, is there a good book on DB design that I should make part of my reading over the next few months? Something long on practical with enough theory so I'd know the "whys" of good db design? I don't intend to ever be a db programmer, but I like to know enough about whatever I touch that I don't create more problems than I solve.

A Truly ID-iotic Design

Leave a comment on “A Truly ID-iotic Design”