The Daily WTF: Curious Perversions in Information Technology

2011-06-30 Reply Admin

And I did not say guarantee, I said Statistically guarantee. To guarantee you need 366, to statistically guarantee you need 55.

Well actually you said "statistically garuntee" and 27 rather than 55, but lets not split hairs. You then gave a completely wrong explanation of the probability of it. You can't add probabilities in the way you did unless the events are mutually exclusive. By your reasoning, if you throw a die 6 times then you're guaranteed to get a 6.

The correct way to work it out is to consider the probability of not getting a match.

Let's leave aside leap years for a minute.

With 1 person in a room the probability of not having two sharing a birthday is 1.

With a 2nd person in the room, the probability is 1 * 364/365.

With a 3rd person in the room, the probability is 1 * 364/365 * 363/365.

And so on. The more people, the smaller the probability of not getting a match gets.

To find the probability of getting a match you simply subtract from 1, and has someone else already pointed out, the probability of getting a match exceeds 50% at around the 23 person mark.

If you go on up to 27 people then the probability of a match rises to about 63%, but still a long way from a guarantee. On your higher figure of 55 you do achieve a 98.6% probability, which is probably a much better way of expressing it.

HTH John

2011-06-30 Reply Admin

John WInters:
Just as a birthday is 1 in 365, but it takes only about 55 people to statiscally garuntee a match.

Just doubling the number doesn't make it correct.

To guarantee a match, you need 366 people.

Do please go and read that wikipedia page on the topic to which you were referred earlier. You're recollection of the problem is faulty, and your reasoning defective.

Actually you'd need 367 people to guarantee a match. Remember leap year.

When the originally poster said "statistically guarantee", I presume he meant to have such a high probability of a match that you could, for practical purposes, rely on it happening. Like statisticians talk about a 95% or 99% confidence interval.

2011-06-30 Reply Admin

IMHO, using natural keys as primary keys has a number of advantages. Like:

They can eliminate joins. Like the person discussing currency code mentioned, if a large percentage of the time the only thing you want to know about a record is it's natural key, then using the natural key as a foreign key eliminates the need to do the join at all. We don't need to look up the record by surrogate key and then retrieve the natural key if we already have the natural key.
It makes ad hoc database queries simpler and easier to understand. I can dump a record and immediately see what related records it is pointing to, rather than having to write joins to do this.
It reduces the number of indexes. We almost surely have to index on the natural key. If we have a surrogate key for the primary key, then we also have to index on that, and thus we have one more key per table. Extra keys reduce performance.
It may simplify code. If we want to do a bunch of processing related to store X, for example, with a surrogate key we would have to first look up store X to get its primary key, cache this, and then do the look ups of the related records. With a natural key, the input value is the primary key and we eliminate that step.

None of these advantages are earth-shattering, of course, But then none of the advantages of surrogate keys are earth-shattering either.

Let me hasten to add that there are plenty of really bad examples of attempts to use a natural key. Like my earlier post about last name plus birth date plus zip code. But the fact that something can be used badly doesn't mean it can't be used intelligently. There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries. That doesn't prove that lawnmowers a bad thing that should be banned. It just proves that stupid people can always find a way to misuse things that are otherwise useful.

2011-06-30 Reply Admin

Databastard:
I worked for a large nationwide retail chain once. Primary key for the store table was store number -- naturally. In the inventory table, which told us how many of each item was on hand in each store, there was a foreign key to store number. ... The first digit of the store number identified which distribution center it got its merchandise from. Oops!! Danger! Danger! There is meaningful information stored in the key!!! ... All because somebody was ignorant of the basic idea that a key should never hold any meaningful information.

I agree with you 100% that a key should not include any "incidental" data about the entity. A key should identify the entity, period.

This is really a special case of the general database design principle, "One fact per field." Identifying the record is one fact. Identifying the region or distribution center or hair color or whatever is another fact. It should be another field.

That said, this really has little to do with the fact that they used a natural key instead of a surrogate key. It was a badly designed natural key, true. But I'm presently working on a system that uses surrogate keys, but those keys are all server number that wrote the record plus a sequence number. It was done because the original developers of the system didn't believe in using autonumbering or sequences, and so they had to keep a separate set of sequence numbers for each server. (Why they didn't believe in sequences is another story.) Occassionally someone will say how this is a great idea because when we're investigating production problems sometimes it's useful to know whether two records were written by the same server or different servers. I say, if that's useful information, we should have a field for it and record it cleanly, not cram it into the primary key.

My point is, yes, people can cram extraneous information into a natural key. People can cram extraneos information into a surrogate key. Either idea is bad.

I might add as a footnote that as, by definition, the system designer does not control the format and content of natural keys, there may be information embedded in there without your knowledge or consent. If this is information that you don't care about in your system, then it's a non-issue. Like, Social Security Number has a region code embedded in it, the region from which the SSN was issued. But most systems don't care where the customer or employee or patient or whomever got his SSN, so the fact that this is embedded is irrelevant.

2011-06-30 Reply Admin

Jay:
There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries.

Really?

frits · 2011-06-30 Reply Admin

trtrwtf:
Jay:
There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries.

Really?

No. A straw man would obviously use a baler.

2011-06-30 Reply Admin

Jay:

Did you ever notice that statements that start out, "I am not X ...", very often end up with a statement that pretty much identifies the speaker as X by definition? Like, "I'm not a racist, I just think that all black people are ...", or "I'm not a censor, I just think that books like that should not be allowed to be printed."

I am not a person who replies to comments, but...

hoodaticus · 2011-06-30 Reply Admin

Ollie Jones:
This is normal. All production dbms instances look like buried utility lines in a city.
In enterprise software, we need 1-800-dig-safe to come and spray orange paint on stuff before we change it.

Seriously.

QFT

hoodaticus · 2011-06-30 Reply Admin

frits:
trtrwtf:
Jay:
There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries.

Really?

No. A straw man would obviously use a baler.

Too witty not to quote. Very nice.

2011-06-30 Reply Admin

My company recently upgraded our main system's server. And yep it caused all sorts of problems with the overnight batch work. On the old server everything worked fine, with jobs running happily together. But on the new server all hell broke lose as the jobs completed much more quickly than before and started other jobs which had never run alongside each other before and started overwriting data files, locking tables, deadlocking etc.

Yes the schedule was setup incorrectly in the first place, but it didn't use to matter because things went slowly enough that the runs never clashed anyway.

2011-06-30 Reply Admin

itsmo:
Kazan:
with the lot:
Social Security Number? Bingo!!

SSN is not usable as a key, you can have collisions. there are two different guys in iowa with the same SSN. as noted above i helped reimplement software that was implemented based on the same flawed assumption you just made.

WTF - how do the IRS deal with that one? No I don't owe you $1,000,000 - it's that other Iowan...

At the risk of having been beaten to the punch: The IRS e-file system at least uses SSN+Name+Age for uniquely identifying people. This isn't theoretically perfect, but it's close enough.

SSN's are actually partially location-based, depending on where you were born. So it actually becomes more likely for two people to have the same SSN, thus an SSN by itself is not guaranteed to be unique for arbitrarily large groups of people. For two people to be born in the same place, and have the same age, and have the same name, and get the same SSN, is theoretically possible, but incredibly unlikely to happen. (Somewhere upwards of 10,000 people would need to be born in that place that year, and then the overlapping SSN would need to have the same-named person...)

Captcha: paratus. The non-AP version of the AP Paratus course.

2011-06-30 Reply Admin

K:
But then I want to ask something more serious: Why would anyone ever not want a unique, autoincremented integer to be the primary key? I learned in uni that there are many ways to do PKs, but all I have ever implemented was an extra integer with no other meaning than "this is the PK".

You have to sync up the new records that were added in Staging to Production. How are you sure that the same records are there or not when you have some of the same numbers and some different ones?

With organic keys, it's dead simple. But with AutoIncrement meaningless keys, you're going to be a while.

BTW, did I mention that the table has 100 columns and 1000000 rows?

2011-07-01 Reply Admin

boog:
C-Octothorpe:
I guess the sum of all your human attributes may identify you uniquely...
Is addition really applicable? What if two distinct combinations of human attributes add up to the same value? Collisions seem almost guaranteed to occur at some point.

s/sum/cartesian product/

GFK · 2011-07-01 Reply Admin

boog:
nonpartisan:
KattMan:
Pedant:
<quote>To guarantee a match, you need 366 people.</quote>
I think you'll find you need 367 people.

And I did not say guarantee, I said Statistically guarantee. To guarantee you need 366, to statistically guarantee you need 55.

That is an important difference.

There are some years that have 366 days. Ergo, to guarantee a match you need 367 people . . .
Yes, but those years of 366 days only occur 25% of the time*, so technically you need 366.25 people to guarantee a match.

Yes, I know it's not exactly once every 4 years, but let's approximate just for the sake of this silly comment.

The statistical distribution of each existing calendar day amongst the population doesn't matter because you need exactly one of each existing possibility, + 1. That's 367.

2011-07-01 Reply Admin

Jay:
IMHO, using natural keys as primary keys has a number of advantages. Like:

They can eliminate joins. Like the person discussing currency code mentioned, if a large percentage of the time the only thing you want to know about a record is it's natural key, then using the natural key as a foreign key eliminates the need to do the join at all. We don't need to look up the record by surrogate key and then retrieve the natural key if we already have the natural key.

It makes ad hoc database queries simpler and easier to understand. I can dump a record and immediately see what related records it is pointing to, rather than having to write joins to do this.

It reduces the number of indexes. We almost surely have to index on the natural key. If we have a surrogate key for the primary key, then we also have to index on that, and thus we have one more key per table. Extra keys reduce performance.

It may simplify code. If we want to do a bunch of processing related to store X, for example, with a surrogate key we would have to first look up store X to get its primary key, cache this, and then do the look ups of the related records. With a natural key, the input value is the primary key and we eliminate that step.

None of these advantages are earth-shattering, of course, But then none of the advantages of surrogate keys are earth-shattering either.

Let me hasten to add that there are plenty of really bad examples of attempts to use a natural key. Like my earlier post about last name plus birth date plus zip code. But the fact that something can be used badly doesn't mean it can't be used intelligently. There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries. That doesn't prove that lawnmowers a bad thing that should be banned. It just proves that stupid people can always find a way to misuse things that are otherwise useful.

Just a quick reply.

ad 1. I did mention lookup tables. Using natural keys there can reduce the complexity of queries.

ad 2. Depends on the values inside the natural PK column(s).

ad 3. If you have an entity that has both surrogate PK and a unique natural key, indexing the latter one will help. Having an entity, where theres no unique natural key (aside from declaring the whole row as natural key), indexing may only be necessary for speeding up searches.

ad 4. Problem with your example: What if the input value ist wrong? You still need a search function, regardless if you're using natural key or surrogate key.

As for your final statement: Of course there are advantages for both sides (surrogate/synthethic vs. natural key). However, for data models with several levels of master <-> detail relationships, the natural key approach quickly degenerates into cluttered sql queries, because you have to add composite primary keys along the way, at least in my experience. And therein lies my question: Why make your life hard, when you can avoid that?

l.

2011-07-01 Reply Admin

PRMan:
K:
But then I want to ask something more serious: Why would anyone ever not want a unique, autoincremented integer to be the primary key? I learned in uni that there are many ways to do PKs, but all I have ever implemented was an extra integer with no other meaning than "this is the PK".
You have to sync up the new records that were added in Staging to Production. How are you sure that the same records are there or not when you have some of the same numbers and some different ones?
With organic keys, it's dead simple. But with AutoIncrement meaningless keys, you're going to be a while.

BTW, did I mention that the table has 100 columns and 1000000 rows?

Do you mean sync or merge? Syncing is easy, just set different number ranges in the sequences, for example 1000000 - 5000000 for production and 5000000 - 9999999 for staging.

l.

frits · 2011-07-01 Reply Admin

hoodaticus:
frits:
trtrwtf:
Jay:
There are people out there who will try to cut their hair with a lawnmower and then sue the manufacturer for their injuries.

Really?

No. A straw man would obviously use a baler.

Too witty not to quote. Very nice.

Thanks. Welcome back.

boog · 2011-07-01 Reply Admin

GFK:
boog:
Yes, but those years of 366 days only occur 25% of the time*, so technically you need 366.25 people to guarantee a match.

Yes, I know it's not exactly once every 4 years, but let's approximate just for the sake of this silly comment.
The statistical distribution of each existing calendar day amongst the population doesn't matter because you need exactly one of each existing possibility, + 1. That's 367.

No, I'm pretty sure that 366.25 would still guarantee a match.

Don't believe me? Go find 366.25 people with distinct birthdays and prove me wrong then.

2011-07-01 Reply Admin

I go for surrogate keys all the time. Not because I give a shit in the holy war, but for consistency and for ORM ease of use. Everything's the same and you know what to expect coding against any table. The ORM can find it's feet easily. You are only then solving one set of issues that way too.

2011-07-02 Reply Admin

LANMind:
BentFranklin:
Unique, auto-numbered, never null... Shouldn't TRXID have been the primary key all along?

Good Job. You got the exact point of the story.

ITYM "Good boy! Have a Scooby Snack".

2011-07-02 Reply Admin

yorch Bardi:
Meep:
Clearly the natural key to use is the compound key AID and BID.

And you use a CID as PK right?

I'm assuming this was a joke. But in the probably forlorn hope that this site has the potential to actually educate some people on "what not to do":

Meep was clearly saying that (AID,BID) would be the (compound) primary key for table C. Creating a "CID" auto-number column in such cases is insane and give no meaningful benefit.

I assume the reason its so disturbingly common is either because (a) so many developers have been indoctrinated into thinking that all tables must have an auto-number PK or (b) it makes it easy for them to use boiler-plate code and/or crappy/homegrown ORM and databinding layers that make the simplistic assumption that every table has a single auto-number PK column.

2011-07-02 Reply Admin

callcopse:
I go for surrogate keys all the time. Not because I give a shit in the holy war, but for consistency and for ORM ease of use. Everything's the same and you know what to expect coding against any table. The ORM can find it's feet easily. You are only then solving one set of issues that way too.

Told you. Muppets.

2011-07-02 Reply Admin

[quote user="QJo"][quote user="Pedant"][quote user="A Gould"] Ideally (and let's emphasize the "ideal" here), your Primary Key should be how your company tells the various entities apart. An arbitrary key that doesn't relate back to the business risks duplication.[/quote] [/quote]

Forget the dbms, primary keys, indexes, uniqueness constraints and the like for a minute. Just think about the business process: how will the user identify the correct record?

Is there a simple (one or a few attributes) characteristic of the entity which is guaranteed to uniquely identify it? If so, that is the Business Identifier. If not, how will they distinguish entities -- explore all the facts held about the entity until a distinguishing feature is found? Or will you have to introduce an artificial business identifier (customer id or the like) to avoid ambiguity.

If your users will use the "explore around" approach, think about data protection issues too ("Are you the Bob Smith from Ohio who called last week about the order for 1,000 tubes of KY?" "No, that's the other Bob Smith" .... later in the bar: "Guess what the other Bob Smith has been buying...").

If you introduce an artificial business identifier then consider data entry/validation issues (e.g. make sure its not too long to give over the phone, avoid mixed case, give it a check-digit or similar so that typos/bad line errors are caught).

Now you can start to think about the system design. If there is a unique business identifier, it needs a unique constraint (enforced via an index). If using an artificial business identifier then it needs a generator -- note that its unlikely that auto-number or GUID/UUID generators in the DBMS will produce suitable numbers given the considerations above - you may need a UDF or some other stored code to create an appropriate identifier.

Now, if there are foreign key references to the table, AND the columns in the business identifier are either large (long strings) or complex (many columns) or unstable (liable to change), then it may not be appropriate to use the business identifier as the PK in the database. In that case, introduce a surrogate PK (typically an auto-number field). Retain the uniqueness constraint on the business identifier -- both to ensure correctness and also because its probably still the way that the entity will be looked up, so we still want an index on it.

The only remaining arguments are:

i) what is "too many columns" -- some people seem to think "2" is too many, presumably because they aren't using natural join or a decent ORM and don't like typing ON (... AND ... AND ...). This thinking results in less efficient schema (think about navigating from "a detail table of a detail table" to the top-level table).

ii) what is "too unstable" -- some people think that if there's any chance that someone somewhere will require a field to be changed then it is "unstable" and shouldn't be part of the PK. Yes, any time a field is exposed to a user there's a chance some manager with budget/clout will insist on it being changed. But how likely are those changes and how much pain is it really to let ON UPDATE CASCADE do its job? Remember, YAGNI.

iii) what is "too long" -- some people seem to think that any string is inappropriate, resulting in silly (country_code_id, country_code, country_name) tables where a 4-byte integer is used to look up a 2-3 character country code.

Its really not that hard, people.

2011-07-02 Reply Admin

Jay:
When the originally poster said "statistically guarantee", I presume he meant to have such a high probability of a match that you could, for practical purposes, rely on it happening. Like statisticians talk about a 95% or 99% confidence interval.

Another way to look at it is the expectation of the number of matches. With N people, there are N(N-1)/2 "pairs", of which on average 1/365 (ignoring leapyears) are matches. At 28 people, the expected (average) number of matches is just over 1.

(This probably isn't a particularly intuitive way of looking at it, though -- it treats finding 2 or 3 or more matches as "more valuable" than finding just 1, and I suspect that's not how people think about this problem normally.)

2011-07-03 Reply Admin

The real WTF would be an Oracle database that DIDN'T look like this.

2011-07-05 Reply Admin

Jack:
some orders had the same item on two lines meant that the transaction exploded
Yeah, but why would anyone wanna do that? Stupid users. See, it was their fault after all, not the upgrade.

At a guess, that would be double-clicking. That including the timestamp in the primary key was intended to prevent (while in general allowing the same item on two lines).

2011-07-08 Reply Admin

BentFranklin:
Unique, auto-numbered, never null... Shouldn't TRXID have been the primary key all along?

Um no, making TRXID the PK would allow insertions that the old PK obviously would have seen as duplicates. They do seem to be using a PK where they should use a unique constraint though.

2011-07-12 Reply Admin

............obvious answer is obvious.........

tgape · 2011-08-12 Reply Admin

QJo:
C-Octothorpe:
AndrewH:
"Well," Rick sighed, "the user was right."

Not really. The upgrade just made the error more likely to occur; it exposed the problem, it didn't cause it.

Where I come from, if you implement a change and it has side effects, even if they're impossible to plan for, or even because another application is incorrectly pointing at your web service even though they knew it was getting decommissioned like 3 years ago, it's your fault. Doesn't matter that the other app owner didn't fix it correctly, it's still your fault because your change caused something to break.

Now tell me that isn't screwy...

Not screwy at all. There's a bug in the system which is either caused by, or has been uncovered by, activity which you have perpetrated. Therefore it is up to you to find out what its cause is. Even when you find that the underlying cause of the bug was not your direct fault, it is perfectly reasonable to request that you take the responsibility for ensuring that it be fixed.

However, if your change uncovers the bug before it gets to production, that's something else altogether. Then you are fully entitled to jump up and down and crow like a cockerel that you've found another of Billy Bodger's billion bugs and get lots more pats on the back from your admiring co-workers.

The bug was uncovered before their upgrade went to production. I have two points of evidence:

The replacement PK field was already added, it just wasn't used.
The admin was already trained to ignore the error the condition generated.

This shows it was an already known bug, but apparently not one considered worth the time to fix. Probably a mismanagement decision. (It could, of course, be that the person who was tasked with fixing it saw all the places that would need updating, and decided to not do it. But I suspect it's more likely they didn't follow proper change process when performing their ALTER TABLE, and were fired over the resulting downtime. Not saying that's likely - just more likely than the person who went as far as altering the table was the one who decided to not finish the fix.)

That having been said, to defend yourself from the blame, you need to be quick about this - it was clearly too late before this was submitted. Also, it really helps to have your test include adding a bunch of new records, too, rather than just performing a read/update-only test. (I suspect it was doing a read/update-only test because even before the memory upgrade, a programmatic stream of creates would trigger the bug - and by this point that bug was just being dismissed as an annoyance.)

2011-10-14 Reply Admin

that's why this is posted on the daily WTF instead of Mundane IT Success Stories Network.

The Key to a Good Schema

Leave a comment on “The Key to a Good Schema”