- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Because multiplying by three gives you that much extra space for the Unique Key, just in case you needed to fill those extra spaces. Wait, are we really getting an Int back? Huh...
Filed under: Well at least there isn't any Overflow errors?
Admin
Wait, his suggestion was to replace a hash function with another hash function also known to have collisions?
The submitter was smart to remain anonymous if he's looking for a job...
Admin
I like the way Mongo generates UUID and reuse it on some of my applications for public facing keys. So instead of users seeing:
They see
Admin
...go on? I think you accidentally submitted your post early.
Admin
Cthulhu snatched him out of the universe before he could do any further damage to its infrastructure.
Admin
Well you know the difference between a church bell.
The higher the louder!
Admin
So, I guess you could say the Lead Developer... [removes shades] ...made a hash of it?
Admin
What you mean is that the collisions should not be easy to calculate.
And really, WTF, this isn't hashing data supplied by random morons on the Internet who are attacking us. This is a data import function. We just need a unique->unique transformation.
That is, it doesn't matter, so long as you don't do it like this. (a 64-bit value that is calculated as
(value1 << 32) + value2
would suffice).Admin
Admin
That depends what you're doing with the hashcode. If you're using it as a proxy for identity, yes, you should not be able to generate two things with the same code in the same system. If you've got an actual equality testing step later on, you can be a lot more lax.
Admin
Admin
https://www.youtube.com/watch?v=1JyZ_U3Qg58&t=4
Admin
Actually, it can work on data that is larger so long as the data is not information-dense. The main problem is that such perfect hashing techniques are very sensitive to the data, and are in reality a PITA to set up. People use other schemes because they cope with real-world diversity better.
Admin
The identity function is pretty good for this purpose...
Admin
:headdesk:
Admin
Assuming GetHashCode() is uniformly distributed, the sum of two hashes (multiplying by 3 only permutes the possible values) will still be uniform since it's taken mod 2^32 (roll two dice and calculate the distribution of the sum mod 6), and the expected number of unique values in 12,000,000 random choices will be
2^32 - 2^32*(1-1/2^32)^12,000,000 = 11,983,251
So I'm gonna have to call bullshit on a 50% collision rate.
https://en.wikipedia.org/wiki/Birthday_problem#Collision_counting
Admin
Many poster said it implicitly, but may be it should be made explicit ? An hash function do not, by design, return unique values; this is not his purpose, and unless you design your own perfect hashing function, you should never, ever, assume that any library hash function produce distinct values if applied to a set distinct unique values. The requirement on hashing function is that if the hashed values are different, the original value are differents; not the contrary.
Admin
Do you know any hash function that does not?[1] TRWTF is this in the article:
[1] are you lead-est of the lead?
Admin
Admin
They left in the WTF of multiplying one value by 3, but edited out the WTF of writing a homemade hash function with a dramatically restricted range?
Admin
OK, could someone explain something to me? I've seen this kind of WTF more than once in my professional career. What is it with this fascination with creating random/random-looking surrogate keys for data without an existing identifier? (yah, I know the hashcode is not random - but it is like they're treating it as if it is random and unique).
Is there something more efficient about it than a sequence / auto-incrementing column?
I get the benefit of a UUID (replication across systems, etc.)
As near as I can tell all it does for me is anonymize the order in which the data was imported and maybe reduce a round trip to the DB to get the next sequence number (but still require you to check for a collision on your randomly generated ID).
Admin
Since Maciej is creeping closer and closer to resembling Charles Manson, I suggest we send him over to 'talk' to the Lead Developers.
j/k ;-P
Admin
Apparently, for ints in C#, GetHashCode() is not more than an identity function. The String hash-code for
value2
should be rather well-distributed.I wonder what the integer overflow-truncation will do to the distribution of the "UniqueKey". And further, whether it matters if
value1
ever were near the integer limit, so that3 * value1
would sometimes overflow.Admin
In C# (and Java), integer overflow doesn't result in truncation, but in modulo arithmetic (if you're using a checked context in C#, it results in an overflow exception instead). Since 3 has no common factor with 2^32, the multiplication merely shuffles the values. On the other hand, multiplying by an even number would reduce the hash space (by the gcd of the factor and the modulo base). For instance, if you roll a d6 and multiply the result by a fixed factor and take that modulo 6, you get:
Admin
It's almost certainly less efficient, but it has one minor advantage when it comes to (poor) security. With something like an auto-incrementing integer, a person who gets a valid value (e.g. creates a new topic in a forum) can start guessing other nearby values that will most likely also be valid. If no other authentication is needed, then that person can view information that they possibly shouldn't be able to see.
Admin
That's exactly what heppened here: https://what.thedailywtf.com/t/the-bigpug-moon/6911
Admin
Very poor.
Being able to guess the next user ID should not amount to a security problem, since you're not supposed to be allowed to view the information even with a direct link. Otherwise, that's just begging for the most basic social engineering attack - almost no person considers the user ID sensitive information,so ask and ye shall receive.
An auth token of any sort should not be keying the record either, but rather work in conjunction with the key. Unless you want the "sorry, another user already chose this password" security schema.
Admin
Admin
I agree you correct in principle but MD5 hash collisions are rather unlikely (not impossible admittedly) and usually don't happen unless you are trying to produce a collision. So its no good for a security function on short inputs like a MAC but for a database record where the combined attributes are like 100 bytes + its likely just fine. MD5 is also cheap to calculate and happens to be compatible in terms of bytes size with GUIDS.
There may be good reasons to use a hash value as unique key as well, at least in the context of an ETL process. If you are getting large amounts (the submitter is getting millions of rows) of data that is NOT normalized in from some other system for example using a hash makes sense. You can bulk load the data normalizing on the fly without having to do all kinds of look ups.
Admin
Here's a nickel, kid, buy yourself an IDENTITY...
I guess a strong enough hash would be fine, but assuming a scenario where every record needs a unique ID (and not one where two records with identical key tuples should get the same ID - not the case in the submission AFAIK), why bother hashing?
Admin
It's very handy for replicated systems and n-tier applications. In both cases, it allows a system to create a new identity value without having a central authority.
In the case of replicated systems, this allows autonomy - a replica can function when communications are cut off.
In the case of n-tier applications, the client can make a set of related objects and persist them as a batch without having to go to the database to get an id. You avoid the catch-22 of "I can't save it until it's valid, but it won't be valid until it has a related object - but I can't create the relationship until I have an id, which I get as a return value from saving it".
Admin
Yah, I get that. Those are typically situations where you would naturally use a UUID (or an something analogous to one) .
I was more suspecting something may emerge when dealing with massive tables / "big data" situation that I wasn't seeing.
Admin
We have a process where I work that uses the microsecond from a DB2 CURRENT TIMESTAMP to create a "random" identifier for a document to be printed. Turns out there are some problems with this scheme:
Today, the system retains documents only temporarily--usually until printed or until evening (there's a cleanup process). Yet it is surprising how documents that are needed only transiently can build up in the table to the point where there starts to be collisions.
Back when, it was made even worse by the fact that, in the original conception, documents were supposed to be kept in the tables for days or weeks for online review. That lasted an amazingly short period of time--maybe 3 months--at which point it was realized the whole thing was going to become unusable due to collisions/lack of free identifiers.
Un-fixable without requiring manual mods and rebuilding for hundreds of programs due to the code being in a header file and the fact that the 6-digit identifiers are passed around everywhere.
Admin
That won't work.
If I have
userinfo1,userinfo2,userinfo3,productinfo1,productinfo2. I want ever row of product info associated to a user_key in one table, and only unique userinfo1,userinfo2,userinfo3 pairs in my users table. I can't import that in bulk using identity. I would have to do lookups to get the user_key. This is not an uncommon ETL problem and this is a common solution.
Admin
As a matter of fact, the overflow truncation rescues its randomness in a way. With unlimited integer size, adding together two uniform random variables taking integer values between 0 and N-1 would give a random variable with a triangular distribution and values between 0 and 2*N-2, but reducing the sum mod N transforms it back to a uniform distribution with values between 0 and N-1.
Admin
Admin
Also no central place to check, means one can create UUID on different threads/machines without coordinating between them. Of course at the database layer the hash collision should be caught, and dealt with in the application by e.g. generating UUID values again. There are
unique ON
or similar clauses for this purpose.However since this article is about migrating a system, I think an auto-incrementating columns should have been sufficient;
1
and2
could be UUIDs why not (just after making sure database enforces uniqueness and not silently ignoring it :wtf: ).Admin
It is of course possible that the figures in the article may have been slightly exaggerated, but that hardly ever happens so we shouldn't look to it as the most likely cause.
Admin
For those cases I'd usually just go with something like what Twitter did, which was basically to give each application instance a unique ID and generate database IDs with that in it - effectively preventing any other application from generating the same ID, making it unique.