The Daily WTF: Curious Perversions in Information Technology
Welcome to TDWTF Forums Sign in | Join | Help
in Search

Primary Key/Clustered Index

Last post 07-19-2005 4:57 AM by tufty. 48 replies.
Page 1 of 1 (49 items)
Sort Posts: Previous Next
  • 07-06-2005 9:14 AM

    Primary Key/Clustered Index

    Okay,

    I generally don't post much worthwhile stuff, but here's one. Something I find a bit irksome about SQL Server, In the table designer (yes, I use it, so what!) if you select a field to be your primary key, it creates a clustered index. My understanding of clustered indexes was that they should be used to reduce the number of records that need to be scanned, for example (to make it relavent to message boards), you have a table of posts. You might have a clustered index on the posts table using TopicID as its key, rather than the default PostID. What this means is that the rows will no longer be physically arranged by their primary key, but rather by their topic (i.e. when you list the posts in a topic, you narrow the number of rows you have to look at, using another index on a column like PostTime to sort the (much shorter) list)

    My question is: What is the reason for SQL Server using a clustered index on a primary key, and are there situations where this is a strong advantage? (Yes, it should be indexed, thats not in question...)
  • 07-06-2005 9:47 AM In reply to

    Re: Primary Key/Clustered Index

    First (and you seemed to expect this), you should get in the habbit of writing DDL instead of using Enterprise Manager. As the system grows, it becomes migration hell to move between environments (Dev, Test, UAT, Prod, etc) without DDL scripts and change management. Another advantage is that you have control and can explicitly see everything done.

    Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

    That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

    As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.

  • 07-06-2005 10:30 AM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:

    First (and you seemed to expect this), you should get in the habbit of writing DDL instead of using Enterprise Manager. As the system grows, it becomes migration hell to move between environments (Dev, Test, UAT, Prod, etc) without DDL scripts and change management. Another advantage is that you have control and can explicitly see everything done.


    It's a hard habit to break, when everyone around you is in the same habit, but it tends to really come back to haunt you when you need to build scripts for changes to tables... At prior employers they had a rule against using anything but DDL scripts, becuase the script that you used to change the schema had to be added to source control, and also needed to be combined later with other scripts when an update was sent to the data center.

    Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

    I don't know why, but I've always had difficulty with this terminology (I tend to get it mixed up frequently, More along the lines of "I know what I mean", hence the inconsistent use in my own post.)

    That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

    As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.


    Yep, Makes sense.

  • 07-06-2005 3:50 PM In reply to

    Re: Primary Key/Clustered Index

    Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.


    I've never been formally schooled in databasing, nor have the kind of in-depth knowledge displayed here so often, but in my experience I've come to define Field as the intersection of a row and column in a database, and a Record as a Row With Data. Synonyms mostly, but Field and Record because it's a database table, not yer average table within whatever context or environment.

    The derivatives vs integrals comparisin means nothing to me. 8-|
    — Flurp.
  • 07-06-2005 4:55 PM In reply to

    Re: Primary Key/Clustered Index

     dhromed wrote:

    I've never been formally schooled in databasing, nor have the kind of in-depth knowledge displayed here so often, but in my experience I've come to define Field as the intersection of a row and column in a database, and a Record as a Row With Data. Synonyms mostly, but Field and Record because it's a database table, not yer average table within whatever context or environment.

    The derivatives vs integrals comparisin means nothing to me. 8-|

    A lot of (non-database) people say "rows/records, fields/columns, tomato/tomatah". But there really are fundamental differences between the two. They're not synonymous at all. A non-calc example would be comparing a "type" (or "class") with an "object". Similar, but fundamentally different.

    Correction for your terminology:

    • The intersection of a Row and a Column is a Value.
    • A Field, on the other hand, is an application-specific part of a Record.
    • A Row cannot, by definition, exist without data. Tables need not have rows, but a row must contain data.

    And as an FYI, here's some more differences on Records vs Rows, Fields vs Columns. I wrote this in some other post here  ....

  • Records are definied within the application (think COBOL or C-structs reading from binary files). Rows are definied in the database.
  • Records do not have consistant structure. In file systems, flags within a file record determine what data follow All Rows within a table are the same.
  • Records are ordered. Rows have no order.
  • An empty file (where you store records) contains nothing: zero bytes on disk. An empty table still has columns, permissions,  constraints, etc.
  • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)
  • Fields are ordered, columns are not
  • Fields cannot be "null" (what combination of bytes would you use to represent an integer null?); they can only contain a "dummy value" known by the application (999999999).
  • Fields can contain other records. Columns are always scalar (many programmers also make the mistake of the "CSV" column)
  • 07-06-2005 5:19 PM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:


  • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)


  • Ah, my head.. :)

    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule?  Or, what about a self referencing constraint?  You have an Offices table, for instance, with a "Central Office" column.  The CO column requires a value that exists within the Office table.

    I guess I'm asking what you consider a "pointer" in terms of the database.

    Thanks.

  • 07-06-2005 6:16 PM In reply to

    Re: Primary Key/Clustered Index

     bmschkerke wrote:
     Alex Papadimoulis wrote:


  • Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)


  • Ah, my head.. :)

    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule?  Or, what about a self referencing constraint?  You have an Offices table, for instance, with a "Central Office" column.  The CO column requires a value that exists within the Office table.

    I guess I'm asking what you consider a "pointer" in terms of the database.

    Thanks.



    I was going to ask the same thing.

    Also, I appreciate your discussion about the differences between "field" and "column," "record" and "row," etc.  Because I wanted to know what the hell you were going on about, I read up on the subject and feel that I have really deepened my knowledge of the relational model.  I saw things that brought back horrible, terrible memories from some of my more theoretical math classes in college.  So, again, thanks for spurring me on to learning!

    That said, I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row."  With absolutely no intention of offending you, it seems a little odd.  Like if you were sitting next to someone, reviewing some poorly written code, and he said, "what do you supposed he means to be doing with this class here" as he very clearly pointed to an object.  Chances are, he's not a total idiot and he realizes the difference between an object and a class, so there's no need to scream, "That's an object, not a class!  They really are very different things!" every time he slips.

    Anyway, like I said, I sincerely am not trying to be rude or condescending.  It's just that I find your behaviour in this instance puzzling, that's all.  Also, please realize that I realize that it's proabably my own lack of experience that's causing your behavior to seem puzzling to me.  I'm willing to bet I'll understand when I've got more experience.


  • 07-06-2005 6:23 PM In reply to

    Re: Primary Key/Clustered Index

     bmschkerke wrote:
    I guess I'm asking what you consider a "pointer" in terms of the database.

    A pointer is something that does not exist in the real-world problem domain and are not exposed to end-users of the system. 

     bmschkerke wrote:
    In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule? 

    Possibly. Why do you need an IDENTITY column? Addresses are standardized and will usually a DUN numbers. If not, it's really a problem for the business to solve. Generally, they solve this with (Cust_No + Address_Seq).

  • 07-06-2005 8:00 PM In reply to

    Re: Primary Key/Clustered Index

     UncleMidriff wrote:
    I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row" ... I find your behaviour in this instance puzzling, that's all. 

    Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

    Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

    To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

    And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!

  • 07-06-2005 9:22 PM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:

     UncleMidriff wrote:
    I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row" ... I find your behaviour in this instance puzzling, that's all. 

    Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

    Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

    To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

    And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!



    I see..that makes sense.  Thanks for the explanation.

    About pointers:  If you have the time and the inclination, could you give an example of a few tables with rows that contain pointers and what you would do to make them not suck?  Even though you posted about it twice, I'm still having trouble understanding exactly what it is that you mean.  It almost sounds to me like you're saying that meaningless primary keys are a bad thing, but I figure if that's what you meant you would have come right out and said it already.  If that is what you mean, I don't know what to think, because I've heard convincing arguments for both sides of that argument.
  • 07-06-2005 11:45 PM In reply to

    Re: Primary Key/Clustered Index

     UncleMidriff wrote:
    It almost sounds to me like you're saying that meaningless primary keys are a bad thing ... I've heard convincing arguments for both sides of that argument.

    Yes, that's pretty much another way of putting it. Non-domain keys are by definition meaningless and functionaly act as row pointers. Pointers (according to the information rule of RDMS) are non-relational.

    When I first started hacking around in a database, I had no idea what I was doing and ended up adding a sequence on every table (this was back in the day when DBs did not easily support AUTOID columns). It just seemed the logical way to do things.

    After studying relational theory and it's predecessors (CODASYL, IMS, etc), it became pretty clear how the relational model is a much better way of doing things. If you're interested in the topic, Elsmari and Nathe have a good textbook that goes in much detail about non-relational models.

    You'll find that those who advocate pointer-based data modeling (IDENTITY, GUID, etc) are not seasoned data modelers nor do they have an understanding of the history of data modeling. If they were, they would have run into the same problem with their methods that the creators of the RM had back in the late 60's.

  • 07-07-2005 8:25 AM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:
    To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.


    I'm sure a lot of the confusion spurs form the programming side. Take, for example, ADO, which allows interaction between the application and a RDBMS. ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

    I'm defending myself in this instance, I understand rows and columns, and actually understand some of the finer differences of rows vs records and columns vs fields, now thanks to you, but with APIs still refering to things using the wrong terminology, it's very easy to get tripped up.

    Your opinion on IDENTITY fields is very interesting, and I have often thought about just that, because it seems to be pretty much a habit of the IT industry in general: Have a row of information? tack an ID on it so we can get to it later. Its almost a mantra, though in some cases, some kind of ID number would be useful: Take the example of a table to customers. Using first and last name as the identifying piece of information will not work, becuase there's always more than one John Smith. Another piece of information is necessary, such as a phone number. In this case, it could be simpler to select by an arbitrary assigned value to look up a particular customer record. Also, in tables, such as something like order history, the foriegn key would require 3 pieces of information. It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

  • 07-07-2005 9:50 AM In reply to

    Re: Primary Key/Clustered Index

     Mike R wrote:
    I'm sure a lot of the confusion spurs form the programming side.

    Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

     Mike R wrote:

    ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

    I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

    Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

     Mike R wrote:

    IDENTITY ... seems to be pretty much a habit of the IT industry in general

    Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

     Mike R wrote:
    It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

    Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

    The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

    When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

    One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

    Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.

  • 07-07-2005 10:56 AM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:

     Mike R wrote:
    I'm sure a lot of the confusion spurs form the programming side.

    Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

     Mike R wrote:

    ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

    I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

    Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

     Mike R wrote:

    IDENTITY ... seems to be pretty much a habit of the IT industry in general

    Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

     Mike R wrote:
    It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

    Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

    The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

    When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

    One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

    Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.



    But how does using an indentity column or other form of meaningless primary key on a table lead to data corruption?  I'm just not seeing it.  The only thing in my admittedly limited experience that I have seen cause problems is when someone decides not to use a meaningless primary key.  That has inevitably led to severe problems because whatever meaningful bit of data they chose for the primary key needed to be changed in some way later on in the life of the project.  Granted, just because a few yahoos before me couldn't pick a proper, meaningful primary key doesn't mean that meaningful primary keys are bad, and I'm not trying to imply that.  But I just can't see how meaningless primary keys, by their very nature, lead to data corruption.
  • 07-07-2005 11:57 AM In reply to

    Re: Primary Key/Clustered Index

     UncleMidriff wrote:

    I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

    They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

    One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

    When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

    Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

    Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

    If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

    After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

    Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

    All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

    Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.

  • 07-07-2005 2:19 PM In reply to

    Re: Primary Key/Clustered Index

    I don't need an identity column, but didn't have a suitable primary key replacement.  Duplicate addresses are to be accepted, if entered.  We have an entity that has up to five addresses.  Rather create columns of StreetAddress1, StreetAddress2, City, State, ZipCode five times I moved it into a separate table.  Since each entity can have multiple addresses I cannot use the entity table's primary key as an address primary key.  We deal primarily with individuals and so don't track DUN numbers. 

    The customer number and sequence would work, possibly.  However, I have linked this table to other tables, not all of which share a similar logical structure.  (Not all addresses are related to customers, for example.) 

    Should each table have a separate Addresses table, so that it can be managed using a parent table's keys?  Or denormalize it if they're so closely linked, even if the information might often be null for the optional addresses?
  • 07-07-2005 2:55 PM In reply to

    Re: Primary Key/Clustered Index

     Alex Papadimoulis wrote:

     UncleMidriff wrote:

    I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

    They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

    One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

    When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

    Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

    Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

    If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

    After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

    Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

    All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

    Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.



    So, in this system, the values in the Variable column of the KOV table are supposed to be unique, and thus it could have been used as a primary key.  Also, the users can change the value in the Variable column of a given row whenever the heck they please.  Provided that these assumptions are correct, here's what I would have done:

    Since the value of Variable for a given KOV is expected to change, and since I wouldn't want to change every row in every table that referenced a particualr KOV everytime someone changed the value of Variable for that KOV, I would have used an Id column as the primary key.  Then I would have added a trigger that would whine if someone tried to add or change a KOV with the same Variable value as an existing KOV.  To keep a history of changes, I would have added another table, named something like "KOVChangeLog," in which I would store the Id and old Variable values of the changed KOV, as well as the user who changed it and the time at which it was changed.

    When someone changes a KOV, the system would check to see if any procedures reference that KOV.  If so, the KOVChangeLog table would have added to it the Id and old Variable value of that KOV along with the user responsible for and the time of the change, and then that KOV would be updated.  If not, that KOV would just be updated.  The trigger would ensure that the user wouldn't be able to change that KOV to a KOV that already exists, and the meaningless Id column used as the primary key would ensure that every row in every table referencing that KOV wouldn't have to be updated everytime that KOV might be changed.

    I apologize if the above is a huge WTF for any reason, and I offer the excuse that I haven't been at this very long.

    Another question:  how is the use of pointers "non-relational" according to the Information Rule?  I'm assuming here that you are talking about Codd's 12 Rules.

     Wikipedia.org wrote:

    The information rule: All information in the database to be represented in one and only one way, namely by values in column positions within rows of tables.


    Perhaps I'm just dense, but I don't see how using meaningless primary keys violates this rule.
  • 07-07-2005 3:59 PM In reply to

    Re: Primary Key/Clustered Index

     bmschkerke wrote:

    The customer number and sequence would work, possibly.  However, I have linked this table to other tables, not all of which share a similar logical structure.  (Not all addresses are related to customers, for example.) 

    Should each table have a separate Addresses table, so that it can be managed using a parent table's keys?  Or denormalize it if they're so closely linked, even if the information might often be null for the optional addresses?

    You're on the right track by putting it in a (single) addresses table. Two things to think about are orphaned addresses (an address without a Customer, Vendor, etc) and shared addresses (an address shared between a Customer and a Vendor). Both may or may not be problematic depending on your problem domain.

  • 07-07-2005 4:40 PM In reply to