- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Well actually you said "statistically garuntee" and 27 rather than 55, but lets not split hairs. You then gave a completely wrong explanation of the probability of it. You can't add probabilities in the way you did unless the events are mutually exclusive. By your reasoning, if you throw a die 6 times then you're guaranteed to get a 6.
The correct way to work it out is to consider the probability of not getting a match.
Let's leave aside leap years for a minute.
With 1 person in a room the probability of not having two sharing a birthday is 1.
With a 2nd person in the room, the probability is 1 * 364/365.
With a 3rd person in the room, the probability is 1 * 364/365 * 363/365.
And so on. The more people, the smaller the probability of not getting a match gets.
To find the probability of getting a match you simply subtract from 1, and has someone else already pointed out, the probability of getting a match exceeds 50% at around the 23 person mark.
If you go on up to 27 people then the probability of a match rises to about 63%, but still a long way from a guarantee. On your higher figure of 55 you do achieve a 98.6% probability, which is probably a much better way of expressing it.
HTH John
Admin
Actually you'd need 367 people to guarantee a match. Remember leap year.
When the originally poster said "statistically guarantee", I presume he meant to have such a high probability of a match that you could, for practical purposes, rely on it happening. Like statisticians talk about a 95% or 99% confidence interval.
Admin
IMHO, using natural keys as primary keys has a number of advantages. Like:
They can eliminate joins. Like the person discussing currency code mentioned, if a large percentage of the time the only thing you want to know about a record is it's natural key, then using the natural key as a foreign key eliminates the need to do the join at all. We don't need to look up the record by surrogate key and then retrieve the natural key if we already have the natural key.
It makes ad hoc database queries simpler and easier to understand. I can dump a record and immediately see what related records it is pointing to, rather than having to write joins to do this.
It reduces the number of indexes. We almost surely have to index on the natural key. If we have a surrogate key for the primary key, then we also have to index on that, and thus we have one more key per table. Extra keys reduce performance.
It may simplify code. If we want to do a bunch of processing related to store X, for example, with a surrogate key we would have to first look up store X to get its primary key, cache this, and then do the look ups of the related records. With a natural key, the input value is the primary key and we eliminate that step.
None of these advantages are earth-shattering, of course, But then none of the advantages of surrogate keys are earth-shattering either.
Let me hasten to add that there are plenty of really bad examples of attempts to use a natural key. Like my earlier post about last name plus birth date plus zip code. But the fact that something can be used badly doesn't mean it can't be used intelligently. There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries. That doesn't prove that lawnmowers a bad thing that should be banned. It just proves that stupid people can always find a way to misuse things that are otherwise useful.
Admin
I agree with you 100% that a key should not include any "incidental" data about the entity. A key should identify the entity, period.
This is really a special case of the general database design principle, "One fact per field." Identifying the record is one fact. Identifying the region or distribution center or hair color or whatever is another fact. It should be another field.
That said, this really has little to do with the fact that they used a natural key instead of a surrogate key. It was a badly designed natural key, true. But I'm presently working on a system that uses surrogate keys, but those keys are all server number that wrote the record plus a sequence number. It was done because the original developers of the system didn't believe in using autonumbering or sequences, and so they had to keep a separate set of sequence numbers for each server. (Why they didn't believe in sequences is another story.) Occassionally someone will say how this is a great idea because when we're investigating production problems sometimes it's useful to know whether two records were written by the same server or different servers. I say, if that's useful information, we should have a field for it and record it cleanly, not cram it into the primary key.
My point is, yes, people can cram extraneous information into a natural key. People can cram extraneos information into a surrogate key. Either idea is bad.
I might add as a footnote that as, by definition, the system designer does not control the format and content of natural keys, there may be information embedded in there without your knowledge or consent. If this is information that you don't care about in your system, then it's a non-issue. Like, Social Security Number has a region code embedded in it, the region from which the SSN was issued. But most systems don't care where the customer or employee or patient or whomever got his SSN, so the fact that this is embedded is irrelevant.
Admin
Really?
Admin
No. A straw man would obviously use a baler.
Admin
I am not a person who replies to comments, but...
Admin
Admin
Admin
My company recently upgraded our main system's server. And yep it caused all sorts of problems with the overnight batch work. On the old server everything worked fine, with jobs running happily together. But on the new server all hell broke lose as the jobs completed much more quickly than before and started other jobs which had never run alongside each other before and started overwriting data files, locking tables, deadlocking etc.
Yes the schedule was setup incorrectly in the first place, but it didn't use to matter because things went slowly enough that the runs never clashed anyway.
Admin
At the risk of having been beaten to the punch: The IRS e-file system at least uses SSN+Name+Age for uniquely identifying people. This isn't theoretically perfect, but it's close enough.
SSN's are actually partially location-based, depending on where you were born. So it actually becomes more likely for two people to have the same SSN, thus an SSN by itself is not guaranteed to be unique for arbitrarily large groups of people. For two people to be born in the same place, and have the same age, and have the same name, and get the same SSN, is theoretically possible, but incredibly unlikely to happen. (Somewhere upwards of 10,000 people would need to be born in that place that year, and then the overlapping SSN would need to have the same-named person...)
Captcha: paratus. The non-AP version of the AP Paratus course.
Admin
With organic keys, it's dead simple. But with AutoIncrement meaningless keys, you're going to be a while.
BTW, did I mention that the table has 100 columns and 1000000 rows?
Admin
s/sum/cartesian product/
Admin
The statistical distribution of each existing calendar day amongst the population doesn't matter because you need exactly one of each existing possibility, + 1. That's 367.
Admin
Just a quick reply.
ad 1. I did mention lookup tables. Using natural keys there can reduce the complexity of queries.
ad 2. Depends on the values inside the natural PK column(s).
ad 3. If you have an entity that has both surrogate PK and a unique natural key, indexing the latter one will help. Having an entity, where theres no unique natural key (aside from declaring the whole row as natural key), indexing may only be necessary for speeding up searches.
ad 4. Problem with your example: What if the input value ist wrong? You still need a search function, regardless if you're using natural key or surrogate key.
As for your final statement: Of course there are advantages for both sides (surrogate/synthethic vs. natural key). However, for data models with several levels of master <-> detail relationships, the natural key approach quickly degenerates into cluttered sql queries, because you have to add composite primary keys along the way, at least in my experience. And therein lies my question: Why make your life hard, when you can avoid that?
l.
Admin
Do you mean sync or merge? Syncing is easy, just set different number ranges in the sequences, for example 1000000 - 5000000 for production and 5000000 - 9999999 for staging.
l.
Admin
Thanks. Welcome back.
Admin
Don't believe me? Go find 366.25 people with distinct birthdays and prove me wrong then.
Admin
I go for surrogate keys all the time. Not because I give a shit in the holy war, but for consistency and for ORM ease of use. Everything's the same and you know what to expect coding against any table. The ORM can find it's feet easily. You are only then solving one set of issues that way too.
Admin
ITYM "Good boy! Have a Scooby Snack".
Admin
I'm assuming this was a joke. But in the probably forlorn hope that this site has the potential to actually educate some people on "what not to do":
Meep was clearly saying that (AID,BID) would be the (compound) primary key for table C. Creating a "CID" auto-number column in such cases is insane and give no meaningful benefit.
I assume the reason its so disturbingly common is either because (a) so many developers have been indoctrinated into thinking that all tables must have an auto-number PK or (b) it makes it easy for them to use boiler-plate code and/or crappy/homegrown ORM and databinding layers that make the simplistic assumption that every table has a single auto-number PK column.
Admin
Told you. Muppets.
Admin
[quote user="QJo"][quote user="Pedant"][quote user="A Gould"] Ideally (and let's emphasize the "ideal" here), your Primary Key should be how your company tells the various entities apart. An arbitrary key that doesn't relate back to the business risks duplication.[/quote] [/quote]
Forget the dbms, primary keys, indexes, uniqueness constraints and the like for a minute. Just think about the business process: how will the user identify the correct record?
Is there a simple (one or a few attributes) characteristic of the entity which is guaranteed to uniquely identify it? If so, that is the Business Identifier. If not, how will they distinguish entities -- explore all the facts held about the entity until a distinguishing feature is found? Or will you have to introduce an artificial business identifier (customer id or the like) to avoid ambiguity.
If your users will use the "explore around" approach, think about data protection issues too ("Are you the Bob Smith from Ohio who called last week about the order for 1,000 tubes of KY?" "No, that's the other Bob Smith" .... later in the bar: "Guess what the other Bob Smith has been buying...").
If you introduce an artificial business identifier then consider data entry/validation issues (e.g. make sure its not too long to give over the phone, avoid mixed case, give it a check-digit or similar so that typos/bad line errors are caught).
Now you can start to think about the system design. If there is a unique business identifier, it needs a unique constraint (enforced via an index). If using an artificial business identifier then it needs a generator -- note that its unlikely that auto-number or GUID/UUID generators in the DBMS will produce suitable numbers given the considerations above - you may need a UDF or some other stored code to create an appropriate identifier.
Now, if there are foreign key references to the table, AND the columns in the business identifier are either large (long strings) or complex (many columns) or unstable (liable to change), then it may not be appropriate to use the business identifier as the PK in the database. In that case, introduce a surrogate PK (typically an auto-number field). Retain the uniqueness constraint on the business identifier -- both to ensure correctness and also because its probably still the way that the entity will be looked up, so we still want an index on it.
The only remaining arguments are:
i) what is "too many columns" -- some people seem to think "2" is too many, presumably because they aren't using natural join or a decent ORM and don't like typing ON (... AND ... AND ...). This thinking results in less efficient schema (think about navigating from "a detail table of a detail table" to the top-level table).
ii) what is "too unstable" -- some people think that if there's any chance that someone somewhere will require a field to be changed then it is "unstable" and shouldn't be part of the PK. Yes, any time a field is exposed to a user there's a chance some manager with budget/clout will insist on it being changed. But how likely are those changes and how much pain is it really to let ON UPDATE CASCADE do its job? Remember, YAGNI.
iii) what is "too long" -- some people seem to think that any string is inappropriate, resulting in silly (country_code_id, country_code, country_name) tables where a 4-byte integer is used to look up a 2-3 character country code.
Its really not that hard, people.
Admin
Another way to look at it is the expectation of the number of matches. With N people, there are N(N-1)/2 "pairs", of which on average 1/365 (ignoring leapyears) are matches. At 28 people, the expected (average) number of matches is just over 1.
(This probably isn't a particularly intuitive way of looking at it, though -- it treats finding 2 or 3 or more matches as "more valuable" than finding just 1, and I suspect that's not how people think about this problem normally.)
Admin
The real WTF would be an Oracle database that DIDN'T look like this.
Admin
Admin
Um no, making TRXID the PK would allow insertions that the old PK obviously would have seen as duplicates. They do seem to be using a PK where they should use a unique constraint though.
Admin
............obvious answer is obvious.........
Admin
The bug was uncovered before their upgrade went to production. I have two points of evidence:
The replacement PK field was already added, it just wasn't used.
The admin was already trained to ignore the error the condition generated.
This shows it was an already known bug, but apparently not one considered worth the time to fix. Probably a mismanagement decision. (It could, of course, be that the person who was tasked with fixing it saw all the places that would need updating, and decided to not do it. But I suspect it's more likely they didn't follow proper change process when performing their ALTER TABLE, and were fired over the resulting downtime. Not saying that's likely - just more likely than the person who went as far as altering the table was the one who decided to not finish the fix.)
That having been said, to defend yourself from the blame, you need to be quick about this - it was clearly too late before this was submitted. Also, it really helps to have your test include adding a bunch of new records, too, rather than just performing a read/update-only test. (I suspect it was doing a read/update-only test because even before the memory upgrade, a programmatic stream of creates would trigger the bug - and by this point that bug was just being dismissed as an annoyance.)
Admin
that's why this is posted on the daily WTF instead of Mundane IT Success Stories Network.