|
Primary Key/Clustered Index
Last post 07-19-2005 4:57 AM by tufty. 48 replies.
-
07-06-2005 9:14 AM
|
|
-
Mike R


- Joined on 11-22-2004
- Posts 440
|
Primary Key/Clustered Index
Okay,
I generally don't post much worthwhile stuff, but here's one. Something
I find a bit irksome about SQL Server, In the table designer (yes, I
use it, so what!) if you select a field to be your primary key, it
creates a clustered index. My understanding of clustered indexes was
that they should be used to reduce the number of records that need to
be scanned, for example (to make it relavent to message boards), you
have a table of posts. You might have a clustered index on the posts
table using TopicID as its key, rather than the default PostID. What
this means is that the rows will no longer be physically arranged by
their primary key, but rather by their topic (i.e. when you list the
posts in a topic, you narrow the number of rows you have to look at,
using another index on a column like PostTime to sort the (much
shorter) list)
My question is: What is the reason for SQL Server using a clustered
index on a primary key, and are there situations where this is a strong
advantage? (Yes, it should be indexed, thats not in question...)
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
First (and you seemed to expect this), you should get in the habbit of writing DDL instead of using Enterprise Manager. As the system grows, it becomes migration hell to move between environments (Dev, Test, UAT, Prod, etc) without DDL scripts and change management. Another advantage is that you have control and can explicitly see everything done.
Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.
That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).
As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.
|
|
-
-
Mike R


- Joined on 11-22-2004
- Posts 440
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: | First (and you seemed to expect
this), you should get in the habbit of writing DDL instead of using
Enterprise Manager. As the system grows, it becomes migration hell to
move between environments (Dev, Test, UAT, Prod, etc) without DDL
scripts and change management. Another advantage is that you have
control and can explicitly see everything done. |
|
It's a hard habit to break, when everyone around you is in the same
habit, but it tends to really come back to haunt you when you need to
build scripts for changes to tables... At prior employers they had a
rule against using anything but DDL scripts, becuase the script that
you used to change the schema had to be added to source control, and
also needed to be combined later with other scripts when an update was
sent to the data center.
| Second, fields do not exist in a relational database. Columns
do. There is a fundamental difference between the two terms. Ditto for
rows versus records. Usage of the proper ensures a strong mental model
of the domain; think mixing up your derivatives and integrals in
calculus. |
|
I don't know why, but I've always had difficulty with this
terminology (I tend to get it mixed up frequently, More along the lines
of "I know what I mean", hence the inconsistent use in my own post.)
| That said, a clustered index will physically order the rows
as they are inserted/updated based on the column(s) indexed. This also
happens to be why only one clustered index is permitted per table. SQL
Server defaults to a clustered index on the primary key because PK's
are unique (therefore ensuring an even spread of rows across pages with
no "hotspots") and because PK's don't change very often (or shouldn't,
at least).
As you indicated, a TopicID is a better candidate for your clustered
index because you'll often be pulling an entire Topic worth of posts.
Since they're close together physically, it saves a lot of IO. But if
you don't have such a requirement, you're usually best sticking your
clustered index on the PK so you've got a good random distribution
across pages. |
|
Yep, Makes sense.
|
|
-
-
dhromed


- Joined on 04-13-2005
- Dutchland
- Posts 2,532
|
Re: Primary Key/Clustered Index
| Second, fields do not exist in a relational database. Columns
do. There is a fundamental difference between the two terms. Ditto for
rows versus records. Usage of the proper ensures a strong mental model
of the domain; think mixing up your derivatives and integrals in
calculus. |
|
I've never been formally schooled in databasing, nor have the kind of
in-depth knowledge displayed here so often, but in my experience I've
come to define Field as the intersection of a row and column in a
database, and a Record as a Row With Data. Synonyms mostly, but Field
and Record because it's a database table, not yer average table within
whatever context or environment.
The derivatives vs integrals comparisin means nothing to me. 8-|
— Flurp.
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
dhromed wrote: | |
I've never been formally schooled in databasing, nor have the kind of in-depth knowledge displayed here so often, but in my experience I've come to define Field as the intersection of a row and column in a database, and a Record as a Row With Data. Synonyms mostly, but Field and Record because it's a database table, not yer average table within whatever context or environment.
The derivatives vs integrals comparisin means nothing to me. 8-|
|
|
A lot of (non-database) people say "rows/records, fields/columns, tomato/tomatah". But there really are fundamental differences between the two. They're not synonymous at all. A non-calc example would be comparing a "type" (or "class") with an "object". Similar, but fundamentally different.
Correction for your terminology:
- The intersection of a Row and a Column is a Value.
- A Field, on the other hand, is an application-specific part of a Record.
- A Row cannot, by definition, exist without data. Tables need not have rows, but a row must contain data.
And as an FYI, here's some more differences on Records vs Rows, Fields vs Columns. I wrote this in some other post here ....
- Records are definied within the application (think COBOL or C-structs reading from binary files). Rows are definied in the database.
- Records do not have consistant structure. In file systems, flags within a file record determine what data follow All Rows within a table are the same.
- Records are ordered. Rows have no order.
- An empty file (where you store records) contains nothing: zero bytes on disk. An empty table still has columns, permissions, constraints, etc.
- Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)
- Fields are ordered, columns are not
- Fields cannot be "null" (what combination of bytes would you use to represent an integer null?); they can only contain a "dummy value" known by the application (999999999).
- Fields can contain other records. Columns are always scalar (many programmers also make the mistake of the "CSV" column)
|
|
-
-
bmschkerke


- Joined on 07-06-2005
- Posts 16
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: |
- Most importantly, records contain other records (or a pointer to
another record in another file). Rows cannot contain other rows and
*should not* contain pointers to other rows (many programmers make this
mistake by putting pointers in their rows known as GUID or IDENTITY)
|
|
Ah, my head.. :)
In reference to the above - if you're storing addresses in an Addresses
table, with the primary key being an autoincrementing identity column,
does using that identity column in the Orders table to locate the
related address violate the above rule? Or, what about a self
referencing constraint? You have an Offices table, for instance,
with a "Central Office" column. The CO column requires a value
that exists within the Office table.
I guess I'm asking what you consider a "pointer" in terms of the database.
Thanks.
|
|
-
-
UncleMidriff


- Joined on 12-17-2004
- Oklahoma City, Oklahoma
- Posts 211
|
Re: Primary Key/Clustered Index
bmschkerke wrote: | Alex Papadimoulis wrote: |
- Most importantly, records contain other records (or a pointer to
another record in another file). Rows cannot contain other rows and
*should not* contain pointers to other rows (many programmers make this
mistake by putting pointers in their rows known as GUID or IDENTITY)
|
|
Ah, my head.. :)
In reference to the above - if you're storing addresses in an Addresses
table, with the primary key being an autoincrementing identity column,
does using that identity column in the Orders table to locate the
related address violate the above rule? Or, what about a self
referencing constraint? You have an Offices table, for instance,
with a "Central Office" column. The CO column requires a value
that exists within the Office table.
I guess I'm asking what you consider a "pointer" in terms of the database.
Thanks.
|
|
I was going to ask the same thing.
Also, I appreciate your discussion about the differences between
"field" and "column," "record" and "row," etc. Because I wanted
to know what the hell you were going on about, I read up on the subject and feel that I have really deepened my
knowledge of the relational model. I saw things that brought back
horrible, terrible memories from some of my more theoretical math
classes in college. So, again, thanks for spurring me on to
learning!
That said, I don't understand why you make such a big deal whenever
someone slips up and says "record" when they meant "row." With
absolutely no intention of offending you, it seems a little odd.
Like if you were sitting next to someone, reviewing some poorly written
code, and he said, "what do you supposed he means to be doing with this
class here" as he very clearly pointed to an object. Chances are,
he's not a total idiot and he realizes the difference between an object
and a class, so there's no need to scream, "That's an object, not a
class! They really are very different things!" every time he
slips.
Anyway, like I said, I sincerely am not trying to be rude or
condescending. It's just that I find your behaviour in this
instance puzzling, that's all. Also, please realize that I
realize that it's proabably my own lack of experience that's causing your
behavior to seem puzzling to me. I'm willing to bet I'll
understand when I've got more experience.
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
bmschkerke wrote: | | I guess I'm asking what you consider a "pointer" in terms of the database. |
|
A pointer is something that does not exist in the real-world problem domain and are not exposed to end-users of the system.
bmschkerke wrote: | | In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule? |
|
Possibly. Why do you need an IDENTITY column? Addresses are standardized and will usually a DUN numbers. If not, it's really a problem for the business to solve. Generally, they solve this with (Cust_No + Address_Seq).
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
UncleMidriff wrote: | | I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row" ... I find your behaviour in this instance puzzling, that's all. |
|
Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.
Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.
To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.
And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!
|
|
-
-
UncleMidriff


- Joined on 12-17-2004
- Oklahoma City, Oklahoma
- Posts 211
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: | UncleMidriff wrote: | | I don't
understand why you make such a big deal whenever someone slips up and
says "record" when they meant "row" ... I find your behaviour in this
instance puzzling, that's all. |
|
Programming is one of the few professions where any one can grab a
compiler and make a living writing code. Plumbers, barbers, and even
insurance CSRs are required to maintain a licence to do their
day-to-day job. But unlike a bad haircut or a leaky faucet, bad
programming can cause massive financial loss and even death
due; I've personally seen the former and have heard quite a few
anecdotes about the latter.
Over the years, I've come to realize that bad programming (at the
level I mentioned) is rarely done by a lazy or incompetent
programmer; it's done by programmers who have absolutely no idea
what they are doing yet eagarly do it anyway. Worse still, they don't
even realize that they don't know what they're doing.
To my point, when I see people use the wrong terminology, it implies
that they really don't understand the basics, yet are trying to do it
anyway. Mathmeticians don't mix up sines and integrals; carpenters
don't mix up biscuit joints and bevels; and database programmers
shouldn't mix up fields and columns.
And usually, I've find that when people don't know the basics, they
are still doing it anyway as a professional. Myself included. I can
only be thankful that I didn't cause too much financial loss (only an
unmaintainable system here and there). I only wish someone "yelled" at
me earlier! |
|
I see..that makes sense. Thanks for the explanation.
About pointers: If you have the time and the inclination, could
you give an example of a few tables with rows that contain pointers and
what you would do to make them not suck? Even though you posted
about it twice, I'm still having trouble understanding exactly what it
is that you mean. It almost sounds to me like you're saying that
meaningless primary keys are a bad thing, but I figure if that's what
you meant you would have come right out and said it already. If
that is what you mean, I don't know what to think, because I've heard convincing arguments for both sides of that argument.
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
UncleMidriff wrote: | | It almost sounds to me like you're saying that meaningless primary keys are a bad thing ... I've heard convincing arguments for both sides of that argument. |
|
Yes, that's pretty much another way of putting it. Non-domain keys are by definition meaningless and functionaly act as row pointers. Pointers (according to the information rule of RDMS) are non-relational.
When I first started hacking around in a database, I had no idea what I was doing and ended up adding a sequence on every table (this was back in the day when DBs did not easily support AUTOID columns). It just seemed the logical way to do things.
After studying relational theory and it's predecessors (CODASYL, IMS, etc), it became pretty clear how the relational model is a much better way of doing things. If you're interested in the topic, Elsmari and Nathe have a good textbook that goes in much detail about non-relational models.
You'll find that those who advocate pointer-based data modeling (IDENTITY, GUID, etc) are not seasoned data modelers nor do they have an understanding of the history of data modeling. If they were, they would have run into the same problem with their methods that the creators of the RM had back in the late 60's.
|
|
-
-
Mike R


- Joined on 11-22-2004
- Posts 440
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: | | To my point, when I see people use the
wrong terminology, it implies that they really don't understand the
basics, yet are trying to do it anyway. Mathmeticians don't mix up
sines and integrals; carpenters don't mix up biscuit joints and bevels;
and database programmers shouldn't mix up fields and
columns. |
|
I'm sure a lot of the confusion spurs form the programming side. Take,
for example, ADO, which allows interaction between the application and
a RDBMS. ADO refers to a set of rows as a recordset, and a set of
values as a list of fields.
I'm defending myself in this instance, I understand rows and columns,
and actually understand some of the finer differences of rows vs
records and columns vs fields, now thanks to you, but with APIs still
refering to things using the wrong terminology, it's very easy to get
tripped up.
Your opinion on IDENTITY fields is very interesting, and I have often
thought about just that, because it seems to be pretty much a habit of
the IT industry in general: Have a row of information? tack an ID on it
so we can get to it later. Its almost a mantra, though in some cases,
some kind of ID number would be useful: Take the example of a table to
customers. Using first and last name as the identifying piece of
information will not work, becuase there's always more than one John
Smith. Another piece of information is necessary, such as a phone
number. In this case, it could be simpler to select by an arbitrary
assigned value to look up a particular customer record. Also, in
tables, such as something like order history, the foriegn key would
require 3 pieces of information. It seems more efficient to store an
IDENTITY value, or some other unique value such as a GUID or hash of
the 3 pieces of identifying information (though a hash can, by its very
nature, collide...), In this case, it seems better just to go with
IDENTITY values. Am I wrong?
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
Mike R wrote: | | I'm sure a lot of the confusion spurs form the programming side. |
|
Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.
Mike R wrote: | ADO refers to a set of rows as a recordset, and a set of values as a list of fields. |
|
I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.
Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.
Mike R wrote: | IDENTITY ... seems to be pretty much a habit of the IT industry in general
|
|
Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...
Mike R wrote: | | It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong? |
|
Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.
The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.
When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.
One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.
Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.
|
|
-
-
UncleMidriff


- Joined on 12-17-2004
- Oklahoma City, Oklahoma
- Posts 211
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: | Mike R wrote: | | I'm sure a lot of the confusion spurs form the programming side. |
|
Absolutely. The relational model just doesn't "make sense" from a
procedural programming perspective. This is why it took 20 years for
them to evolve out of earlier models.
Mike R wrote: | ADO refers to a set of rows as a recordset, and a set of values as a list of fields. |
|
I wondered that, too. I'm glad to see ADO.NET changed it, but
perhaps one reason behind the original naming conventions was the
generalized nature of ADO. Anyone could write a provoider from any
source (from AD to to MAPI) and in many of those sources, tables are
not the mechanism by which data is stored.
Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.
Mike R wrote: | IDENTITY ... seems to be pretty much a habit of the IT industry in general
|
|
Some one else mentioned that to me a while a go, and I think it's an
exposure thing. On big systems developed by experienced, data
specialists, you'll never see an autonumber or equivilent as the
system. These same experienced data folks don't write in general
publications; they have their own set of magazines that us non-data
specialists would never read. The people who do write in the general
publications don't fall in the category of specialist; they seem to be
more generalist. And you know the jack of all / master of none rule ...
Mike R wrote: | | It seems more efficient to store an IDENTITY
value, or some other unique value such as a GUID or hash of the 3
pieces of identifying information (though a hash can, by its very
nature, collide...), In this case, it seems better just to go with
IDENTITY values. Am I wrong? |
|
Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.
The nearest "cousin" to the relational model is CODASYL (aka
"network database"), with the most popular platform being IDMS. I think
Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL
database have a DBKEY, which was a system generated hash based on their
physical storage. Any record in the database could be retreived simply
by knowing it's DBKEY; yoou didn't even need to know it's type
(Customer, Product, etc). One of the huge advantages to this approach
is that it was rediculously fast. The reason we don't have CODASYL
databases anymore are because the of the inherent problems with
pointer-based models: data integrity, pointer-chains, etc.
When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a
relational database, you are essentially playing CODASYL. Worse, you're
not using a CODASYL database to do do it, so you have to try to
reinvent all the features they stuck in CODASYL databases to make
pointer-based models work. As the system grows, it gets horribly ugly
and the data begins to corode. And when it comes time for HIPPA/SOX/etc
audits, whooo-boy: you don't even want to know how many
hundreds-of-dollars an hour and how many hours it takes for data
consultants charge to fix your system and bring it to complaince.
One of my favorite "massive financial loss" stories comes from a
procedures system developed by a reputable consulting companies (for
<$100k) that used the "ID everwhere" model (they obviously didn't
bring in any data specialists). When the company failed an ISO audit,
they had to pay over $500,000 to fix the data in this system. The data
corrosion came from an incredibly tiny bug in the procedural code that,
had they done the database right, would have only caused an error
message to popup instead of data corruption.
Whenever I see people advocating for pointer-models, I think back to
Santayana: Those who cannot remember the past are condemned to repeat
it. |
|
But how does using an indentity column or other form of meaningless
primary key on a table lead to data corruption? I'm just not
seeing it. The only thing in my admittedly limited experience
that I have seen cause problems is when someone decides
not to use a meaningless primary key. That has
inevitably led to severe problems because whatever meaningful bit of
data they chose for the primary key needed to be changed in some way
later on in the life of the project. Granted, just because a few
yahoos before me couldn't pick a proper, meaningful primary key doesn't mean that
meaningful primary keys are bad, and I'm not trying to imply
that. But I just can't see how meaningless primary keys, by
their very nature, lead to data corruption.
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
UncleMidriff wrote: | I just can't see how meaningless primary keys, by their very nature, lead to data corruption.
|
|
They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.
One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.
When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:
Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).
Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.
If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.
After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.
Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.
All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.
Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.
|
|
-
-
bmschkerke


- Joined on 07-06-2005
- Posts 16
|
Re: Primary Key/Clustered Index
I don't need an identity column, but didn't have a suitable primary key
replacement. Duplicate addresses are to be accepted, if
entered. We have an entity that has up to five addresses.
Rather create columns of StreetAddress1, StreetAddress2, City, State,
ZipCode five times I moved it into a separate table. Since each
entity can have multiple addresses I cannot use the entity table's
primary key as an address primary key. We deal primarily with
individuals and so don't track DUN numbers.
The customer number and sequence would work, possibly. However, I
have linked this table to other tables, not all of which share a
similar logical structure. (Not all addresses are related to
customers, for example.)
Should each table have a separate Addresses table, so that it can be
managed using a parent table's keys? Or denormalize it if they're
so closely linked, even if the information might often be null for the
optional addresses?
|
|
-
-
UncleMidriff


- Joined on 12-17-2004
- Oklahoma City, Oklahoma
- Posts 211
|
Re: Primary Key/Clustered Index
Alex Papadimoulis wrote: | UncleMidriff wrote: | I just can't see how meaningless primary keys, by their very nature, lead to data corruption.
|
|
They realized long ago that, no matter how good the programmers are,
you still will end up with defects in procedural "middle-tier" code. If
it's the occasional error message or crash, it's rarely a big deal.
When the defect causes data corruption (especially
historical/complaince data), you have serious problems.
One of the biggest innovations with the RDBMS was that data
integrity is handled in the database. Instead of relying on the middle
tier, the RDBMS was responsbile for enforcing business rules on whether
or not data is valid: everything from birthdays occuring at least
eighteen years ago to orders having existing products.
When you use pointers to relate data instead of the data iteself,
you lose the ability to maintain referential integrity. It's hard to
see it in smaller systems. One of the simplist example comes from the
case I mentioned (the $500k bug) and was a direct result of the
inability to have DI at the database:
Everytime a procedure was changed, a new revision of the procedure
was created and the old revision became inactive. Inactive procedures
could not be changed and a trigger was used to enforce this ( IF
EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).
Procedures had user-maintinable discrete domains (picklists,
dropdowns) of various things. For example, Key_Output_Variables := (Id
int, Variable varchar(15), Active_Ind char(1)). The procedure table
referenced a KOV with a foreign key constraint.
If the user wanted to rename a KOV, the system would check if any
procedures used the KOV and, if so, create a new KOV and inactive the
old one, thereby ensuring all old KOV remain in the system. However,
there was a bug in this maintence module that caused it to update the
KOV anyway (in some cases, if the user did it in a certain way). There
were a handful of test cases to test this but none of them caught the
peculiar way of changing the KOV.
After five years of being in production, auditors discovered this by
comparing paper and electronic copies of inactive revisions. The only
way the could correct the problem (and pass ISO) was to take three
years worth of tape backups, paper copies, etc, hire an army of temps
and try to reconcile the thousands upon thousands of procedures.
Had they not used an ID pointer, the trigger would rejected the
update before it snowballed into half a million. This is pretty much
how it usuaully happens: a minute bug in the Middle tier slowly
corrupting historical data.
All in all, if you use these non-relational techniques, you end up
trying to have to put in gobs of work-around code (the bandaid for this
problem was to put in a trigger that queried the meta-tables and used
dynamic sql to check what tables used that the value) which (since it's
lots of code) will be inherently prone to defects. But the true
nightmare begins when you try to build a data warehouse.
Again, you see these problems in large systems. But, a lot of times, systems start out small and become large. |
|
So, in this system, the values in the Variable column of the KOV table are supposed to
be unique, and thus it could have been used as a primary key. Also,
the users can change the value in the Variable column of a given row
whenever the heck they please. Provided that these assumptions
are correct, here's what I would have done:
Since the value of Variable for a given KOV is expected to change, and
since I wouldn't want to change every row in every table that
referenced a particualr KOV everytime someone changed the value of Variable for that
KOV, I would have used an Id column as the primary key. Then I
would have added a trigger that would whine if someone tried to add or
change a KOV with the same Variable value as an existing KOV. To
keep a history of changes, I would have added another table, named
something like "KOVChangeLog," in which I would store the Id and old
Variable values of the changed KOV, as well as the user who changed it
and the time at which it was changed.
When someone changes a KOV, the system would check to see if any
procedures reference that KOV. If so, the KOVChangeLog table
would have added to it the Id and old Variable value of that KOV along
with the user responsible for and the time of the change, and then that
KOV would be updated. If not, that KOV would just be
updated. The trigger would ensure that the user wouldn't be able
to change that KOV to a KOV that already exists, and the meaningless Id
column used as the primary key would ensure that every row in every
table referencing that KOV wouldn't have to be updated everytime that
KOV might be changed.
I apologize if the above is a huge WTF for any reason, and I offer the excuse that I haven't been at this very long.
Another question: how is the use of pointers "non-relational"
according to the Information Rule? I'm assuming here that you are
talking about Codd's 12 Rules.
Wikipedia.org wrote: |
The information rule:
All information in the database to be represented in one and only
one way, namely by values in column positions within rows of tables.
|
|
Perhaps I'm just dense, but I don't see how using meaningless primary keys violates this rule.
|
|
-
-
Alex Papadimoulis


- Joined on 10-16-2004
- Cleveland, OH
- Posts 2,042
|
Re: Primary Key/Clustered Index
bmschkerke wrote: | The customer number and sequence would work, possibly. However, I have linked this table to other tables, not all of which share a similar logical structure. (Not all addresses are related to customers, for example.)
Should each table have a separate Addresses table, so that it can be managed using a parent table's keys? Or denormalize it if they're so closely linked, even if the information might often be null for the optional addresses?
|
|
You're on the right track by putting it in a (single) addresses table. Two things to think about are orphaned addresses (an address without a Customer, Vendor, etc) and shared addresses (an address shared between a Customer and a Vendor). Both may or may not be problematic depending on your problem domain.
|
|
-
|
|