The Daily WTF: Curious Perversions in Information Technology

2019-10-28 Reply Admin

Sure, I see a natural WTF, but where's the surrogate WTF?

powerlord · 2019-10-28 Reply Admin

Is this article TRYING to bring up the natural vs. surrogate keys debate?

2019-10-28 Reply Admin

The natural answer would be “yes” which is quite uniquely defined. The surrogate answer would be a0533dea-c083-46d6-87a8-46c40f8f13e7-yes

DocMonster · 2019-10-28 Reply Admin

However, you can go too far with surrogate keys. In some cases, the natural key is enough. Like anything else, it needs to be done in moderation and blindly following "Everything needs an ID sequence as the primary key!" results in a lot of meaningless joins on meaningless integers. While you may have a States table that has an ID for some reason, it's stupid to reference the state a customer lives in by StateID and not the code; all this does is add an extra join for no reason.

Mr. TA · 2019-10-28 Reply Admin

Wrong, always use the integer identifier. Consistency is king. Joins to small lookup tables have almost no performance cost.

Steve_The_Cynic · 2019-10-28 Reply Admin

In some cases, the natural key is enough.

The only cases where the natural key is enough are the ones where the natural key's individual values will never change. Note: not the list, but the individual value. In the example of a States table, using the two-letter abbreviation as a (natural) key is more or less reasonable, since those abbreviations aren't going to change - we aren't going to wake up one morning to find that New York's abbreviation is no longer NY.

What generally bothers me is sites, usually American, that insist that I provide a state even though they know I'm in France, where the concept implied in US usage by a state is not applicable. (Really, guys. I live in {postal code} {name of town}, and that's all you need to know. In normal usage, nobody in France writes the name of their département, nor the name of their région(1), when providing an actual postal address.)

(1) For added amusement value, the names and composition of the régions changed a few years back. I no longer live in Nord-Pas-de-Calais, but in Hauts-de-France, but that change did not disturb the postal system because the name of your région is not part of your address.

2019-10-28 Reply Admin

Who said that the lookup table will be small? The discussion is about natural keys vs surrogate keys. When you declare that surrogate keys are "always" better, it also applies to large databases.

Let's pick an example: SKU codes for a large ecommerce website, like Amazon. According to you, there should be a surrogate key for every single product. Well, that's a terrible idea. Same thing for usernames in general and more importantly on large systems like Gmail. Even if NoSQL databases are used you can bet they're not relying on surrogate keys.

There's no such thing as "always" in software architecture.

Mr. TA · 2019-10-28 Reply Admin

Great point. Systems designed with "state code" as a key and foreign key reference are hard coded not to be compatible with any country that's not federated. Furthermore, even some federated countries don't include their region in the address like France, and even some who do may not use a 2 letter code, and lots of times it's not a state, but a province, like Canada. Therefore easy rule to follow - never use a natural key.

Mr. TA · 2019-10-28 Reply Admin

True, there's no always in anything in life. Systems like Amazon and Gmail need to be very inventive with horizontal scaling and many relational concepts go out the window. However, even in this example, either SKU is a surrogate key created by Amazon itself, so it's not natural, or it's a 3rd party code, which isn't guaranteed to be unique. With gmail, usernames can and do change and make for terrible keys. Try again.

Auction_God · 2019-10-28 Reply Admin

Amazon didn't use SKU as their surrogate key - they came up with their own "Amazon Id". Because...SKUs can change!

2019-10-28 Reply Admin

This article is a great example of why the bar to use a natural key should be thought of a "really high". The developer was obviously one who leaned towards natural kays and he figured that plate number and timestamp was "good enough". What he didn't realize was that "good enough" almost literally means "sometimes doesn't work". So, since the surrogate key always works, why go through the trouble of introducing even a small number of difficult records by trying to force a natural key - and in this case a natural key that happens to sometimes be surrogate! Also, the developer who decided to use random numbers really needs to look into the birthday paradox.

2019-10-28 Reply Admin

Database design is not about making things neat and comforting for database designers. It's about allowing for the best persistence and retrieval mechanism possible for a specific system.

For instance, usernames don't change often, which means that using a surrogate key to account for those rare events will add pressure on the system the rest of the time for useless lookups or joins - it is a bad design. The optimal solution in that case would be to use the natural key and build some form of data transformation workflow for the edge cases like changing the username.

In other cases, it makes sense to use a surrogate key, especially on small reference lists, like countries... but usually the entities in those situations already have some kind of accepted alias, like 12 for December or USA for United States. Whether "USA" constitutes a natural or surrogate key is up for debate.

Best practices are guidelines, not a prison sentence.

2019-10-28 Reply Admin

Respectfully, if joining on a user ID is meaningful overhead, you're either working on a very large system (which, obviously, happens, and is a legitimate concern), or something in your codebase is about to show up on TDWTF.

2019-10-28 Reply Admin

This kind of reminds me of a system developed by a DBA I used to work with. This system was a large number of computations to calculate metrics on programs.

When building it, the DBA told me he was going to organize the computations by program name since, "primary keys can change."

I told him that was fucking retarded. One, identities do not change and two, the administrative UI for this system allows changes on program name.

Sure enough, pretty much the first thing an admin did was decide they didn't like program 1 to named "xxx" any more, but "yyy", and all the computations quit working.

Also @Raj: the final sentence of your reply is a great point. Don't forget that best practices are just guidelines. Varying from them should not be done without some thought, but there certainly are times when it's valid to make a specific implementation work differently than "normal".

2019-10-28 Reply Admin

What's wrong with the VIN? "Vehicle Identification Number" sounds kind of like exactly what they want.

2019-10-28 Reply Admin

return 4;

I guess it's a good thing for The Developer that he was in a state where license plate numbers never changed. Here in Texas, they suddenly give you new plates every few years, because the reflective background has a limited rated life. Of course the plates have new numbers, because it's a pain in the ass to make a run of license plates with specific reused numbers, when they can just pull one from the pile of new plates already at the DMV office. And also the car always gets new plates when it is sold.

Mr. TA · 2019-10-28 Reply Admin

Usernames don't change often, but they do sometimes. What then, the database is broken? You might say, we'll cascade the update. Exactly how is that going to be accomplished in a scaled architecture like Google? Bad design.

It's rare that there's an entity which references a user and 90% of the time you need the username. High chance you also need first and last name and timezone, or whatever. So you have to join anyway. Now, joining by an integer is way faster than by string. String keys are bad design.

Months shouldn't be an entity in a database to begin with. Countries - yes maybe there's a system where all you need is the 3 letter country code and surrogate identifier creates overhead. But like I said before, the overhead is non existent because the database will quickly cache the table and the value of having a uniform database design is higher. Secondly, what if the business requirement changes and full name is needed? Then you're stuck with a slower string join to retrieve the required column. Again, natural key is bad design.

You say it's not about making things neat and comforting. Well, unwillingness to write a simple join to get the username or country code is exactly "neat and comforting", whereas understanding the concept of uniform design and architectural patterns is not.

I'm yet to hear an example where a natural key is preferable or even feasible.

2019-10-28 Reply Admin

All this talk about keys and no one mentions that the horrible code for(int i = 0; i < 999999; i++) { numbers.add(i + 1); } should be for(int i = 1; i <= 999999; i++) { numbers.add(i); }

Even assuming the compiler optimizes away the extra million "addition" operations, it still makes my skin crawl to look at it.

2019-10-28 Reply Admin

I used to work at a company whose main business was to combine data from the N+1 different systems used by hospitals to store information about the patients. One particularly difficult piece of data to join to that seen before, were laboratory answers. I believe they were stored using some running id number in the originating system, but the way our system received the data was this horrible HL7 protocol, which had no idea of those ids.

In HL7 messages, they were identified by an examination id, but that was in practice only used by humans to refer to a specific examination done at a specific day to a specific patient. The id was only four digits long, and rolled around several times a week. Each examination had some kind of a type, but even with that there were several collisions per year. In the end, the key we used to identify each examination was something like id + type + laboratory + patient's social security number, and even with that, there was a genuine fear of accidentally joining data from two different examinations for the same patient.

2019-10-28 Reply Admin

A Table whose purpose is keeping data concerning Document Versioning. The primary key is given by document id and an incremental number - the version. That's a natural key, would you really add a surrogate key for that?

Steve_The_Cynic · 2019-10-28 Reply Admin

Usernames don't change often, but they do sometimes.

The most common instance of this is where the username is of the form "firstname.lastname" (whatever "first" and "last" mean in this context), and the user in question is a woman who follows the slightly dated custom of using her spouse's last name after marriage. The former Mary Smith is going to want to be able to log in as "mary.jones" rather than "mary.smith" after she marries John Jones and changes to Mary Jones.

Not to mention nutjobs like a guy the late Mrs Cynic knew long and long ago. He had the questionable habit of going in front of a court (in Massachusetts) about every three months and changing his name legally, just because he liked doing it.

2019-10-28 Reply Admin

Wait, why does the same table have two kinds of records? If you are fudging your natural key like that, it's not even a natural key in the first place.

2019-10-28 Reply Admin

License plates change, but VIN numbers don't. Maybe a natural key using the VIN number would have been a better choice?

2019-10-28 Reply Admin

Falsehoods programmers believe about vehicles: ...

All vehicles have VINs.
VINs do not change. ...

2019-10-28 Reply Admin

Interesting story... I'm in the process of building a car from a kit. The kit has a Manufacturer Certificate of Origin, but no title. When I'm done, I get it inspected and apply for a title. As part of that application, I get to make up a VIN. They give me no format restrictions and they don't check for duplicates. I'm so going to Bobby Tables this one.

2019-10-28 Reply Admin

I think you just shot yourself in the foot there... Amazon does NOT use the SKUs for unique identifiers because they're not unique and can change, i.e. everything wrong with natural keys. Every product gets its own unique to Amazon surrogate key.

Furthermore, using username as a key is also a terrible idea. That's why it took Sony 10 years to be able to allow users to change their names on Playstation Network - because some dope doing the initial design decided to tie everything to the username instead of a user id, so changing that took a massive migration effort. Or one of my workplaces, where the usernames are first initial + lastname + number if needed (not my design) because some genius thought names never change. Well guess what, someone got married. IT had to go in and hack #$%! by hand, and they missed a few places. kaboom

2019-10-28 Reply Admin

I do love first initial + lastname, because there's always a Frank Ucker or something eventually. Bonus points if profanity filters in internal email trigger a disciplinary action.

Mr. TA · 2019-10-28 Reply Admin

Another huge reason why usernames are terrible keys is the sheer number of tables referencing users. Columns like insert user, update user, etc. number in the hundreds. But even audit columns aside, a lot of events and objects in a typical data driven system is centered around the user. Even with cascade on update, that username update transaction will take 30 minutes, as the database will run around scanning and updating many tables to. It's a disaster.

Watson · 2019-10-28 Reply Admin

That is a terrible way to generate a random six-digit number, as evidenced by the need to fix it up afterward. The developer should have realised the right way to do it would have been to build it up incrementally. Have the loop construct an array of a mere ten digits 0-9. Shuffle that array and pick from it a random digit, casting it to a string. That gives the first digit of the synthetic plate number. Do all that five more times and then catenate the six digit strings together.

2019-10-28 Reply Admin

That won't be random enough.

Mr. TA · 2019-10-28 Reply Admin

It's not really natural, is it? You have a database generated ID and a database incremented version, both surrogates. Even date time instead of version isn't natural in this context, because it's a universal measurement which in its essence is a float, and not something dictated by the whims of people. Maybe "natural key" should be called "anthropogenic key" ...

2019-10-28 Reply Admin

This right here is basically my job! I evaluate RNGs used in gambling applications, and this is what you see when the client has no idea what they're doing. There are so many layers of WTF that you don't know what to bring up that needs fixing.

Nevermind the million +1's - that's nothing, compared to the million .add()'s to an ArrayList whose capacity is never initialised.

The Collection.shuffle is absolutely superfluous, as they pick a random index next. That shuffle, by the way, requires a million random numbers - more costly than the previously mentioned operations (probably - depending on how easily the ArrayList grows).

All of that is unnecessary. If you want a random number in a given range, there are functions to do that directly as mentioned in the article.

Why is all this done? To be "random enough" - a misconception I see all the time. People don't understand what makes their process random, and what adds nothing.

Oldtaku · 2019-10-29 Reply Admin

I feel you. Though maybe in the end the worst bit here is that it doesn't even need to be random, yet huge amounts of effort are being wasted to make something badly random. Everything here adds nothing!

2019-10-29 Reply Admin

"I encoded it using a ROT13. And then, to make sure it was really secure, I encoded it in a ROT13 again."

jkshapiro · 2019-10-29 Reply Admin

Just wanted to say how much I appreciate the discussion here. I never really understood why people insisted on creating a whole extra column in every table just to create an arbitrary primary key, when there was a natural key available. What a waste of space! But now I get it and I'm grateful.

That's what I love about this site: yes it's snarky and fun but it's also educational.

2019-10-29 Reply Admin

I think it's more than the space - the space is just an excuse. There are people like you (and me at one time) who just haven't run into all the nastiness natural keys cause. When you do, well A Lesson Was Learned. Then there's people who don't realize that apparently safe and unique natural keys (like VINs and MACs) - aren't. Another Lesson Was Learned.

Then there are the natural key people who are just lazy in the bad way like Extreme Programming people. There's good lazy where you make an effort up front (write a quick utility maybe) to save tons of makework effort later, then there's bad lazy where you slap some crap together now and hope you're gone before it hits the fan. They've run into the problems before but they just don't want to be bothered with making another column per table and dealing with it in the code, so they'll haul out spurious arguments about stuff like speed to justify it. YAGNI! they cry. Well experience says you are gonna need it, so put it in up front when it's 1/10th the work - that's good lazy.

The_Bytemaster · 2019-10-29 Reply Admin

Depending on your DB, you may not even be wasting space with the extra column. If you don't even create a Primary Key in SQL server, for example, it creates one internally to itself to it can keep track of rows and just does not expose them to you. Adding an int column as a primary key to these tables gives you that extra column but does not actually take any additional storage.

The_Bytemaster · 2019-10-29 Reply Admin

Often times, speed can increase. The fastest join you can do, in SQL Server anyway, is on an INT column. Text joins slow down, especially if you go with the default of case insensitive collation.

Meanwhile, if you have the server generating your keys for you, you have some limitations. Going with unique identifiers can allow you to really speed up writes from your application in many cases.

The_Bytemaster · 2019-10-29 Reply Admin

Often times, speed can increase. The fastest join you can do, in SQL Server anyway, is on an INT column. Text joins slow down, especially if you go with the default of case insensitive collation.

Meanwhile, if you have the server generating your keys for you, you have some limitations. Going with unique identifiers can allow you to really speed up writes from your application in many cases.

2019-10-29 Reply Admin

https://en.wikipedia.org/wiki/Vehicle_identification_number says otherwise.

2019-10-29 Reply Admin

On the opposite end of the spectrum, however, there are developers who don't properly enforce relationships by their natural keys and end up defining everything with arbitrary-yet-always-incrementing integer values that have no relevance to uniquely identifying and associating entities, thereby resulting in a greater lack of data integrity. A proper mix of natural keys and surrogate keys is definitely an appropriate choice and one that people should seriously consider.

Good article.

2019-10-29 Reply Admin

I ran this on my computer and it's about 50,000 times slower than just using rand.nextInt(1000000)

Mr. TA · 2019-10-29 Reply Admin

You're wrong, unless you provide one example where natural key is better.

2019-10-30 Reply Admin

Many years ago, a researcher created a database using patient's medical record numbers as primary keys, which is a good natural key. It's also protected medical information. He couldn't show outside researchers his data until he completely rebuilt his database.

2019-10-30 Reply Admin

Where it is better? I don't believe that was the statement I made.

idzy · 2019-10-31 Reply Admin

Same thing for usernames in general

Yes, because nobody ever grows up and may want to change their username from "Dr. Doom" to something more sensible on a forum they've used for years and built a reputation on.

idzy · 2019-10-31 Reply Admin

A proper mix of natural keys and surrogate keys is definitely an appropriate choice and one that people should seriously consider.

In almost every case that a natural key can be genuinely justified as a valid record identifier, a value type would be equally appropriate. Otherwise, you're asking for trouble, unforeseen to you at the time you design the database, but will ultimately be the bane of someone in the future's existence.

This is the reason that in the state of Australia in which I reside, it is impossible to have your driver's license number changed in the event of identity fraud.

idzy · 2019-10-31 Reply Admin

A proper mix of natural keys and surrogate keys is definitely an appropriate choice and one that people should seriously consider.

In almost every case that a natural key can be genuinely justified as a valid record identifier, a value type would be equally appropriate. Otherwise, you're asking for trouble, unforeseen to you at the time you design the database, but will ultimately be the bane of someone in the future's existence.

This is the reason that in the state of Australia in which I reside, it is impossible to have your driver's license number changed in the event of identity fraud.

idzy · 2019-10-31 Reply Admin

A proper mix of natural keys and surrogate keys is definitely an appropriate choice and one that people should seriously consider.

In almost every case that a natural key can be genuinely justified as a valid record identifier, a value type would be equally appropriate. Otherwise, you're asking for trouble, unforeseen to you at the time you design the database, but will ultimately be the bane of someone in the future's existence.

This is the reason that in the state of Australia in which I reside, it is impossible to have your driver's license number changed in the event of identity fraud.

idzy · 2019-10-31 Reply Admin

Going with unique identifiers can allow you to really speed up writes from your application in many cases.

Doesn't using random GUIDs as the primary key (and by default the clustered index) result in loads of unnecessary page splits, leaving your database both fragmented and full of "spare" space, but also causes a whole lot of block-copy operations as the most probable thing to occur when you try to insert a random number into an ordered list of random numbers is that you'll be doing it in the middle of two other numbers?

To Be Random Enough

Leave a comment on “To Be Random Enough”