• (disco)

    Because multiplying by three gives you that much extra space for the Unique Key, just in case you needed to fill those extra spaces. Wait, are we really getting an Int back? Huh...


    Filed under: Well at least there isn't any Overflow errors?

  • (disco)

    Wait, his suggestion was to replace a hash function with another hash function also known to have collisions?

    The submitter was smart to remain anonymous if he's looking for a job...

  • (disco)

    I like the way Mongo generates UUID and reuse it on some of my applications for public facing keys. So instead of users seeing:

    /products/1
    

    They see

    /products/hrksin16hdn75hdnsk76
    
  • (disco) in reply to Eldelshell
    Eldelshell:
    So instead of users seeing:

    ...go on? I think you accidentally submitted your post early.

  • (disco) in reply to LB_

    Cthulhu snatched him out of the universe before he could do any further damage to its infrastructure.

  • (disco) in reply to LB_

    Well you know the difference between a church bell.

    The higher the louder!

  • (disco)

    So, I guess you could say the Lead Developer... [removes shades] ...made a hash of it?

  • (disco) in reply to JBert
    JBert:
    Wait, his suggestion was to replace a hash function with another hash function also known to have collisions?

    The submitter was smart to remain anonymous if he's looking for a job...

    All hash functions are known to have collisions, by definition.

    What you mean is that the collisions should not be easy to calculate.

    And really, WTF, this isn't hashing data supplied by random morons on the Internet who are attacking us. This is a data import function. We just need a unique->unique transformation.

    That is, it doesn't matter, so long as you don't do it like this. (a 64-bit value that is calculated as (value1 << 32) + value2 would suffice).

  • (disco) in reply to ka1axy
    ka1axy:
    [removes shades]
    [image]
  • (disco) in reply to Steve_The_Cynic
    Steve_The_Cynic:
    What you mean is that the collisions should not be easy to calculate.

    That depends what you're doing with the hashcode. If you're using it as a proxy for identity, yes, you should not be able to generate two things with the same code in the same system. If you've got an actual equality testing step later on, you can be a lot more lax.

  • (disco) in reply to Steve_The_Cynic
    Steve_The_Cynic:
    All hash functions are known to have collisions, by definition.
    It's possible to have a hash function that has no collisions, however it only works on data smaller than the hash itself, eg bytes, shorts, and ints.
  • (disco) in reply to Luhmann

    https://www.youtube.com/watch?v=1JyZ_U3Qg58&t=4

  • (disco) in reply to Salamander
    Salamander:
    however it only works on data smaller than the hash itself, eg bytes, shorts, and ints

    Actually, it can work on data that is larger so long as the data is not information-dense. The main problem is that such perfect hashing techniques are very sensitive to the data, and are in reality a PITA to set up. People use other schemes because they cope with real-world diversity better.

  • (disco) in reply to Steve_The_Cynic
    Steve_The_Cynic:
    We just need a unique->unique transformation.

    The identity function is pretty good for this purpose...

  • (disco)
    Article:
    using a pair of keys that uniquely identified a record
    "hashCode". "uniquely identify". You keep using these words...
    Submitter:
    suggested using an MD5 hash to generate unique keys instead
    Aaarg, so much WTF, on so many levels.

    :headdesk:

  • (disco)

    Trivial calculations left as an exercise to the reader.

    Assuming GetHashCode() is uniformly distributed, the sum of two hashes (multiplying by 3 only permutes the possible values) will still be uniform since it's taken mod 2^32 (roll two dice and calculate the distribution of the sum mod 6), and the expected number of unique values in 12,000,000 random choices will be

    2^32 - 2^32*(1-1/2^32)^12,000,000 = 11,983,251

    So I'm gonna have to call bullshit on a 50% collision rate.

    https://en.wikipedia.org/wiki/Birthday_problem#Collision_counting

  • (disco)

    Many poster said it implicitly, but may be it should be made explicit ? An hash function do not, by design, return unique values; this is not his purpose, and unless you design your own perfect hashing function, you should never, ever, assume that any library hash function produce distinct values if applied to a set distinct unique values. The requirement on hashing function is that if the hashed values are different, the original value are differents; not the contrary.

  • (disco) in reply to JBert
    JBert:
    Wait, his suggestion was to replace a hash function with another hash function also known to have collisions?

    Do you know any hash function that does not?[1] TRWTF is this in the article:

    Together with the insert library silently dropping entries with the same key

    [1] are you lead-est of the lead?

  • (disco) in reply to Forgotmylogin1
    Forgotmylogin1:
    So I'm gonna have to call bullshit on a 50% collision rate.
    Or the hash function had a much smaller space. The limit for the number of unique elements from n elements when collision has a chance 1/n is around 63%, i.e. 37% collision. Even with a hash space twice as large as the input space, the collision rate is about 25%.
  • (disco) in reply to Hanzo

    They left in the WTF of multiplying one value by 3, but edited out the WTF of writing a homemade hash function with a dramatically restricted range?

  • (disco)

    OK, could someone explain something to me? I've seen this kind of WTF more than once in my professional career. What is it with this fascination with creating random/random-looking surrogate keys for data without an existing identifier? (yah, I know the hashcode is not random - but it is like they're treating it as if it is random and unique).

    Is there something more efficient about it than a sequence / auto-incrementing column?

    I get the benefit of a UUID (replication across systems, etc.)

    As near as I can tell all it does for me is anonymize the order in which the data was imported and maybe reduce a round trip to the DB to get the next sequence number (but still require you to check for a collision on your randomly generated ID).

  • (disco)

    Since Maciej is creeping closer and closer to resembling Charles Manson, I suggest we send him over to 'talk' to the Lead Developers.

    j/k ;-P

  • (disco)

    Apparently, for ints in C#, GetHashCode() is not more than an identity function. The String hash-code for value2 should be rather well-distributed.

    I wonder what the integer overflow-truncation will do to the distribution of the "UniqueKey". And further, whether it matters if value1 ever were near the integer limit, so that 3 * value1 would sometimes overflow.

  • (disco) in reply to gleemonk

    In C# (and Java), integer overflow doesn't result in truncation, but in modulo arithmetic (if you're using a checked context in C#, it results in an overflow exception instead). Since 3 has no common factor with 2^32, the multiplication merely shuffles the values. On the other hand, multiplying by an even number would reduce the hash space (by the gcd of the factor and the modulo base). For instance, if you roll a d6 and multiply the result by a fixed factor and take that modulo 6, you get:

    • 1 -> 1 2 3 4 5 6
    • 2 -> 2 4 6 2 4 6
    • 3 -> 3 6 3 6 3 6
    • 4 -> 4 2 6 4 2 6
    • 5 -> 5 4 3 2 1 6
    • 6 -> 6 6 6 6 6 6
  • (disco) in reply to CodeSlave
    CodeSlave:
    OK, could someone explain something to me? I've seen this kind of WTF more than once in my professional career. What is it with this fascination with creating random/random-looking surrogate keys for data without an existing identifier? (yah, I know the hashcode is not random - but it is like they're treating it as if it is random and unique).

    Is there something more efficient about it than a sequence / auto-incrementing column?

    It's almost certainly less efficient, but it has one minor advantage when it comes to (poor) security. With something like an auto-incrementing integer, a person who gets a valid value (e.g. creates a new topic in a forum) can start guessing other nearby values that will most likely also be valid. If no other authentication is needed, then that person can view information that they possibly shouldn't be able to see.

  • (disco) in reply to Dragnslcr
    Dragnslcr:
    when it comes to (poor) security. With something like an auto-incrementing integer, a person who gets a valid value (e.g. creates a new topic in a forum) can start guessing other nearby values that will most likely also be valid. If no other authentication is needed, then that person can view information that they possibly shouldn't be able to see.

    That's exactly what heppened here: https://what.thedailywtf.com/t/the-bigpug-moon/6911

  • (disco) in reply to Dragnslcr
    Dragnslcr:
    (poor) security

    Very poor.

    Being able to guess the next user ID should not amount to a security problem, since you're not supposed to be allowed to view the information even with a direct link. Otherwise, that's just begging for the most basic social engineering attack - almost no person considers the user ID sensitive information,so ask and ye shall receive.

    An auth token of any sort should not be keying the record either, but rather work in conjunction with the key. Unless you want the "sorry, another user already chose this password" security schema.

  • (disco) in reply to Maciejasjmj
    Maciejasjmj:
    so ask and ye shall receive.
    My current userid on system 49g429aa8px is 21490 and the password (hashed and salted) is c5f861a48e4bf271d678f56aaecfab526e79f01e.
  • (disco) in reply to JBert

    I agree you correct in principle but MD5 hash collisions are rather unlikely (not impossible admittedly) and usually don't happen unless you are trying to produce a collision. So its no good for a security function on short inputs like a MAC but for a database record where the combined attributes are like 100 bytes + its likely just fine. MD5 is also cheap to calculate and happens to be compatible in terms of bytes size with GUIDS.

    There may be good reasons to use a hash value as unique key as well, at least in the context of an ETL process. If you are getting large amounts (the submitter is getting millions of rows) of data that is NOT normalized in from some other system for example using a hash makes sense. You can bulk load the data normalizing on the fly without having to do all kinds of look ups.

  • (disco) in reply to geoff
    geoff:
    using a hash makes sense

    Here's a nickel, kid, buy yourself an IDENTITY...

    I guess a strong enough hash would be fine, but assuming a scenario where every record needs a unique ID (and not one where two records with identical key tuples should get the same ID - not the case in the submission AFAIK), why bother hashing?

  • (disco) in reply to CodeSlave
    CodeSlave:
    OK, could someone explain something to me? I've seen this kind of WTF more than once in my professional career. What is it with this fascination with creating random/random-looking surrogate keys for data without an existing identifier? (yah, I know the hashcode is not random - but it is like they're treating it as if it is random and unique).

    Is there something more efficient about it than a sequence / auto-incrementing column?

    It's very handy for replicated systems and n-tier applications. In both cases, it allows a system to create a new identity value without having a central authority.

    In the case of replicated systems, this allows autonomy - a replica can function when communications are cut off.

    In the case of n-tier applications, the client can make a set of related objects and persist them as a batch without having to go to the database to get an id. You avoid the catch-22 of "I can't save it until it's valid, but it won't be valid until it has a related object - but I can't create the relationship until I have an id, which I get as a return value from saving it".

  • (disco) in reply to Jaime

    Yah, I get that. Those are typically situations where you would naturally use a UUID (or an something analogous to one) .

    I was more suspecting something may emerge when dealing with massive tables / "big data" situation that I wasn't seeing.

  • (disco)

    We have a process where I work that uses the microsecond from a DB2 CURRENT TIMESTAMP to create a "random" identifier for a document to be printed. Turns out there are some problems with this scheme:

    • There's only 1,000,000 unique microsecond values in a DB2 timestamp (:wtf: now who would have guessed...)
    • Turns out that when you have a mere 10,000 documents in the table, collisions are surprisingly likely (adding the next 10,000 is likely to incur at least 150 collisions)
    • No provision was made for the possibility of a collision, such as search-next-available or try-again so a collision translates to report failure and loss of user-requested work
    • Made worse by the reality that getting the microsecond from a CURRENT TIMESTAMP turns out to be a less than desirable way to get a "random" number (:wtf: non-random distribution, now who would have guessed...)

    Today, the system retains documents only temporarily--usually until printed or until evening (there's a cleanup process). Yet it is surprising how documents that are needed only transiently can build up in the table to the point where there starts to be collisions.

    Back when, it was made even worse by the fact that, in the original conception, documents were supposed to be kept in the tables for days or weeks for online review. That lasted an amazingly short period of time--maybe 3 months--at which point it was realized the whole thing was going to become unusable due to collisions/lack of free identifiers.

    Un-fixable without requiring manual mods and rebuilding for hundreds of programs due to the code being in a header file and the fact that the 6-digit identifiers are passed around everywhere.

  • (disco) in reply to Maciejasjmj

    That won't work.

    If I have

    userinfo1,userinfo2,userinfo3,productinfo1,productinfo2. I want ever row of product info associated to a user_key in one table, and only unique userinfo1,userinfo2,userinfo3 pairs in my users table. I can't import that in bulk using identity. I would have to do lookups to get the user_key. This is not an uncommon ETL problem and this is a common solution.

  • (disco) in reply to gleemonk
    gleemonk:
    I wonder what the integer overflow-truncation will do to the distribution of the "UniqueKey". And further, whether it matters if value1 ever were near the integer limit, so that 3 * value1 would sometimes overflow.

    As a matter of fact, the overflow truncation rescues its randomness in a way. With unlimited integer size, adding together two uniform random variables taking integer values between 0 and N-1 would give a random variable with a triangular distribution and values between 0 and 2*N-2, but reducing the sum mod N transforms it back to a uniform distribution with values between 0 and N-1.

  • (disco) in reply to geoff
    geoff:
    not an uncommon ... is a common
    Ooh, clause-separated double-negatives!
  • (disco) in reply to Jaime

    Also no central place to check, means one can create UUID on different threads/machines without coordinating between them. Of course at the database layer the hash collision should be caught, and dealt with in the application by e.g. generating UUID values again. There are unique ON or similar clauses for this purpose.

    However since this article is about migrating a system, I think an auto-incrementating columns should have been sufficient; 1 and 2 could be UUIDs why not (just after making sure database enforces uniqueness and not silently ignoring it :wtf: ).

  • (disco) in reply to Forgotmylogin1
    Forgotmylogin1:
    So I'm gonna have to call bullshit on a 50% collision rate.
    That depends on how many existing records were already in the table (as well as the hash space as mentioned by @Hanzo). You're not necessarily just looking for collisions among the new records.

    It is of course possible that the figures in the article may have been slightly exaggerated, but that hardly ever happens so we shouldn't look to it as the most likely cause.

  • (disco) in reply to Jaime
    Jaime:
    In the case of n-tier applications, the client can make a set of related objects and persist them as a batch without having to go to the database to get an id.

    For those cases I'd usually just go with something like what Twitter did, which was basically to give each application instance a unique ID and generate database IDs with that in it - effectively preventing any other application from generating the same ID, making it unique.

Leave a comment on “Not So Unique”

Log In or post as a guest

Replying to comment #457715:

« Return to Article