• (cs) in reply to Shiva
    Shiva:
    Damn it you guys, I expect the trolls round here to be the typical "your not too smart" nonsense. OK, well done, well done.
    Your not too smart, are you?

    There, are we happy now?

  • F (unregistered) in reply to TheSHEEEP
    TheSHEEEP:
    Ehrm...

    I feel kinda stupid. Can anyone tell me how that "hash" helped reducing duplicate entries? Because I really don't get how it could do that.

    "Management was satisfied with the reduction ... " should not be taken as meaning that there actually was a reduction.

  • Anonymous (unregistered) in reply to F
    F:
    TheSHEEEP:
    Ehrm...

    I feel kinda stupid. Can anyone tell me how that "hash" helped reducing duplicate entries? Because I really don't get how it could do that.

    "Management was satisfied with the reduction ... " should not be taken as meaning that there actually was a reduction.

    Exactly. The fake hash (basically just a random number) would make it seem like there were less duplicate entries. It wouldn't be true, but it would certainly appear that way.

  • anon (unregistered) in reply to frits

    well you'd obviously need a UUID

    that is, a Unique Universe IDentifier

  • (cs) in reply to Shiva

    Actually, it's (nTrolled % 1) times.

  • drusi (unregistered) in reply to TheSHEEEP
    TheSHEEEP:
    Ehrm...

    I feel kinda stupid. Can anyone tell me how that "hash" helped reducing duplicate entries? Because I really don't get how it could do that.

    Broken clock effect. Every so often, the function would generate a duplicate hash by chance, and occasionally this would correspond to actual duplicate data.

  • C-Octothorpe (unregistered) in reply to anon
    anon:
    well you'd obviously need a UUID

    that is, a Unique Universe IDentifier

    I think it's implemented something like this: http://rubbishsoft.com/longguid/

    I hate akismet with a white-hot passion! Piece of crap-ware!

  • Math (unregistered) in reply to mangobrain

    Once in a while? You mean once in every 9999999999999 * 9999999999999 * 9999999999999 = 1.0E39 data entries?

  • CoderHero (unregistered) in reply to Anonymous
    Anonymous:
    AA:
    Anonymous:
    Anonymous:
    And there was me thinking that a hash was supposed to somehow relate to the item being hashed. This will make it make it much easier to implement hashing algorithms.
    function createHash()
    {
      return globalCounter++;
    }
    
    Hey, that was easy!

    I guess, if you're the kind of person who likes to introduce concurrency bottlenecks into arbitrary nonconcurrent functions.

    Oh my God, I'm exactly that sort of person! I also like long walks on the beach and people with a good sense of humour. Oh, I guess we're not compatible after all.
    Wow, that has so much win I can barely speak!

  • Mijzelf (unregistered) in reply to Damien

    It's client side. You can never guarantee that Math.Seed() and Math.Rand() has the same implementation on all clients.

  • Design Pattern (unregistered) in reply to Math
    Math:
    Once in a while? You mean once in every 9999999999999 * 9999999999999 * 9999999999999 = 1.0E39 data entries?
    Pseudorandom number generator
    wikipedia:
    A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm for generating a sequence of numbers that approximates the properties of random numbers. The sequence is not truly random in that it is completely determined by a relatively small set of initial values, called the PRNG's state.
  • Tharg (unregistered)

    So, nobody has the wit to simply put a unique constraint on the data field in question - ah, I know, the application can maintain data integrity, silly me.

  • C-Octothorpe (unregistered) in reply to Tharg
    Tharg:
    So, nobody has the wit to simply put a unique constraint on the data field in question - ah, I know, the application can maintain data integrity, silly me.

    Isn't that the "Right Way (TM)"?

  • (cs) in reply to bertram
    bertram:
    No, that isn't the joke. The article clearly says that duplication was reduced. This has yet to be explained.

    My guess: Sam's colleague automated the de-duplication script and set it to run once per day. The "hash" code was a smokescreen to let him slough off for a week or so.

  • KRG (unregistered) in reply to bertram

    Let's say that the hashes randomly match 10% of the time without regard to any other property of the entry. Basic statistics would suggest that you'd see a 10% reduction in the absolute number duplicate entries, just as long as you didn't bother to look to see if there was any corresponding decline in valid entries compared to the total number of records processed.

    That's assuming that they even bothered to count how many duplicate records were still slipping through (which would have been a sign to anyone with a sense of what the system was doing that something wasn't right) and didn't simply count the number of rejections as the number of duplicate entries prevented, ignoring actual duplication.

  • C-Octothorpe (unregistered) in reply to KRG
    KRG:
    Let's say that the hashes randomly match 10% of the time without regard to any other property of the entry. Basic statistics would suggest that you'd see a 10% reduction in the absolute number duplicate entries, just as long as you didn't bother to look to see if there was any corresponding decline in valid entries compared to the total number of records processed.

    That's assuming that they even bothered to count how many duplicate records were still slipping through (which would have been a sign to anyone with a sense of what the system was doing that something wasn't right) and didn't simply count the number of rejections as the number of duplicate entries prevented, ignoring actual duplication.

    ...right, but you're ignoring a key piece of information here: management noticed the decline in duplicate entries...

    You have a group of people who get distracted by shiny things analyzing the statistics of their data.

  • Anonymous (unregistered) in reply to Tharg
    Tharg:
    So, nobody has the wit to simply put a unique constraint on the data field in question - ah, I know, the application can maintain data integrity, silly me.
    The data field? Oh, you idealistic DB types! The "database" is flat-file and your only means of interacting with it is via simple text read/writes (this is TDWTF after all). What do you do now?
  • (cs) in reply to AA
    AA:
    Anonymous:
    Anonymous:
    And there was me thinking that a hash was supposed to somehow relate to the item being hashed. This will make it make it much easier to implement hashing algorithms.
    function createHash()
    {
      return globalCounter++;
    }
    
    Hey, that was easy!

    I guess, if you're the kind of person who likes to introduce concurrency bottlenecks into arbitrary nonconcurrent functions.

    Incrementing an integer can be done in an atomic fashion...

  • (cs) in reply to Sir Robin-The-Not-So-Brave
    Sir Robin-The-Not-So-Brave:
    Anonymous:
    Anonymous:
    And there was me thinking that a hash was supposed to somehow relate to the item being hashed. This will make it make it much easier to implement hashing algorithms.
    function createHash()
    {
      return globalCounter++;
    }
    
    Hey, that was easy!

    It's data entry by users. By hand. One wouldn't even need a really long hash. (globalCounter++).toString(16) only once would be more than enough. OTOH 10^48 random numbers is also more than enough to avoid a hash collision in most cases of manual data entry, provided that the random generator is properly seeded. It's a really stupid implementation, but it will probably work provided that you never have to regenerate the same hash from the same source. And it's fewer lines of code than a complete SHA implementation.

    So yeah in theory it's a WTF and I would never write something like this myself, but in practice it works good enough.

    Wow, that comment was an even bigger WTF than the original post! Congratulations!

  • C-Octothorpe (unregistered) in reply to hoodaticus
    hoodaticus:
    Sir Robin-The-Not-So-Brave:
    Anonymous:
    Anonymous:
    And there was me thinking that a hash was supposed to somehow relate to the item being hashed. This will make it make it much easier to implement hashing algorithms.
    function createHash()
    {
      return globalCounter++;
    }
    
    Hey, that was easy!

    It's data entry by users. By hand. One wouldn't even need a really long hash. (globalCounter++).toString(16) only once would be more than enough. OTOH 10^48 random numbers is also more than enough to avoid a hash collision in most cases of manual data entry, provided that the random generator is properly seeded. It's a really stupid implementation, but it will probably work provided that you never have to regenerate the same hash from the same source. And it's fewer lines of code than a complete SHA implementation.

    So yeah in theory it's a WTF and I would never write something like this myself, but in practice it works good enough.

    Wow, that comment was an even bigger WTF than the original post! Congratulations!

    As if you're surprised! This is TDWTF after all...

  • (cs)
    And therein laid the problem

    This is the most maddening thing about this whole article. It should be "lay."

  • Anon (unregistered) in reply to hoodaticus
    hoodaticus:
    Incrementing an integer can be done in an atomic fashion...
    Except "return globalCounter++;" isn't just reading an integer.
  • (cs)

    Regarding the "post twice" vs. "multiple entry of the same data".....

    I vote for the former. If people in different areas have different (unique) sources of data, and each only knows about their own source then the later is unlikely to happen.

    Does not excuse this being a poor way to handle it though...

  • Ralph (unregistered) in reply to dohpaz42
    dohpaz42:
    ...during form submission to check against a user mistakenly hitting submit more than once (which could happen if the submission was taking a while and an impatient user kept pressing submit thinking that would make things go faster...). Granted, there are better ways to guard against that sort of WTFry; simply disabling the submit button when it is pressed...
    How do you plan to disable my submit button when I don't choose to give you control over my browser? And oh by the way you do realize that you are attempting client-side input control, which I can easily defeat, which means you have to implement server-side input control too, and at that point, why bother duplicating the effort on the client where it is only effective some of the time? Because your employer, perhaps, has money to waste?? People who suffer from such sloppy thinking make me long for a device that will reach out of your monitor and slap your face to wake you up.

    I really do wish all you losers who use client side scripts to validate data would just dry up and blow away.

  • Jeff (unregistered) in reply to Just Me
    Just Me:
    A hash of the text will provide a quick way to be sure two texts are different.
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?
  • (cs) in reply to hoodaticus
    hoodaticus:
    Incrementing an integer can be done in an atomic fashion...
    Yes, but "++" is not generally one of those fashions.
  • C-Octothorpe (unregistered) in reply to Ralph
    Ralph:
    dohpaz42:
    ...during form submission to check against a user mistakenly hitting submit more than once (which could happen if the submission was taking a while and an impatient user kept pressing submit thinking that would make things go faster...). Granted, there are better ways to guard against that sort of WTFry; simply disabling the submit button when it is pressed...
    How do you plan to disable my submit button when I don't choose to give you control over my browser? And oh by the way you do realize that you are attempting client-side input control, which I can easily defeat, which means you have to implement server-side input control too, and at that point, why bother duplicating the effort on the client where it is only effective some of the time? Because your employer, perhaps, has money to waste?? People who suffer from such sloppy thinking make me long for a device that will reach out of your monitor and slap your face to wake you up.

    I really do wish all you losers who use client side scripts to validate data would just dry up and blow away.

    Did anybody else hear the wooshing sound? It gets louder everytime I hear it...

    Huh, wierd. The way things are going today, I'm sure we'll hear it again, and again...

  • Mike (unregistered)

    There's no WTF here. OK, except maybe that the OP and his colleague were working with unclear requirements. And that the unique id generator is called a hash. And languages that define modulo arithmetic on non-integers. And that the modulo operator was a no-op due to the input data. And the unnecessary use of integers more than 32-bits wide. But yeah, other than that stuff, not a WTF.

  • Vaca Loca (unregistered) in reply to Ralph
    Ralph:
    dohpaz42:
    ...during form submission to check against a user mistakenly hitting submit more than once (which could happen if the submission was taking a while and an impatient user kept pressing submit thinking that would make things go faster...). Granted, there are better ways to guard against that sort of WTFry; simply disabling the submit button when it is pressed...
    How do you plan to disable my submit button when I don't choose to give you control over my browser? And oh by the way you do realize that you are attempting client-side input control, which I can easily defeat, which means you have to implement server-side input control too, and at that point, why bother duplicating the effort on the client where it is only effective some of the time? Because your employer, perhaps, has money to waste?? People who suffer from such sloppy thinking make me long for a device that will reach out of your monitor and slap your face to wake you up.

    I really do wish all you losers who use client side scripts to validate data would just dry up and blow away.

    Validating client side AND server side saves electrons. Not everyone has high speed internet access and doesn't mind a 800kb page refresh with each incorrect form submission.

  • C-Octothorpe (unregistered) in reply to Jeff
    Jeff:
    Just Me:
    A hash of the text will provide a quick way to be sure two texts are different.
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?

    I think the "idea" was to create a hash of all the field values and pass around a single hash value rather than comparing each field each time (or passing around possibly 30, 50, etc. values). In fact I would store the hash of the record in an indexed column for easy comparison (assuming they have a DB)... Of course this would require a trigger to ensure if the data is ever changed to ensure the hash is recomputed, etc. Conceptually, it's a good idea, but the implementation was an epic fail.

    I once read that the only thing worse than inaccurate data is inaccurate data that you think is right...

  • (cs) in reply to ShatteredArm
    ShatteredArm:
    And therein laid the problem

    This is the most maddening thing about this whole article. It should be "lay."

    It must annoy you to no end whenever anyone "looses" a personal item.

  • Mike (unregistered) in reply to Jeff
    Jeff:
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?
    It's not about comparing two strings. It's about comparing one string against many strings. A hash table or an indexed db hash field can be used to accomplish this much more quickly than brute force.

    (Feeling a bit trolled somehow...)

  • moi (unregistered) in reply to mangobrain

    You sure it would randomly happen? because Math.random() % 1 isn't a random value - it will always be 0; so the hashed value will always be "000000000000000000000000000000000000000000000000"

  • moi (unregistered) in reply to mangobrain
    mangobrain:
    TheSHEEEP:
    Ehrm...

    I feel kinda stupid. Can anyone tell me how that "hash" helped reducing duplicate entries? Because I really don't get how it could do that.

    It didn't, but once in a while, a new submission would (at random) be assigned the same "hash" as an existing submission. That, and a healthy dose of placebo effect.

    You sure it would randomly happen? because Math.random() % 1 isn't a random value - it will always be 0; so the hashed value will always be "000000000000000000000000000000000000000000000000"

  • John Hardin (unregistered)

    Oh, god, the pain. It is blinding.

    I have got to stop coming by TDWTF.

  • aptent (unregistered) in reply to mangobrain

    Once in 1000000000000000000000000000000000000000000000000. Yeah, I really don't think you'll ever have the same hash twice.

  • (cs)

    Yes this code is a massive WTF. But the biggest WTF is their business process. They should be splitting up the work when its assigned to avoid any two people from ever performing duplicate work in the first place.

  • coyo (unregistered) in reply to Visage
    Visage:
    'he just implemented a quick SHA-1'

    Thats the real WTF, right there.

    Yup! This is no guarantee against collisions.

  • C-Octothorpe (unregistered) in reply to cod3_complete
    cod3_complete:
    Yes this code is a massive WTF. But the biggest WTF is their business process. They should be splitting up the work when its assigned to avoid any two people from ever performing duplicate work in the first place.

    No, actually this code is a larger WTF because duplicate data can be handled but coordinating the efforts of users across the world could be a physical impossibility...

    Management trusted the developer to build them something to prevent this problem; instead they got something that actually makes them loose data and randomly may or may not prevent duplicate data from being entered.

    I'd rather have duplicate data then no/missing data.

  • (cs) in reply to PedanticCurmudgeon
    PedanticCurmudgeon:
    ShatteredArm:
    And therein laid the problem

    This is the most maddening thing about this whole article. It should be "lay."

    It must annoy you to no end whenever anyone "looses" a personal item.
    This may well be, but I am more than happy that someone laid this issue to a rest.

  • (cs) in reply to Anon
    Anon:
    hoodaticus:
    Incrementing an integer can be done in an atomic fashion...
    Except "return globalCounter++;" isn't just reading an integer.
    Good point!
  • airdrik (unregistered) in reply to Mike
    Mike:
    Jeff:
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?
    It's not about comparing two strings. It's about comparing one string against many strings. A hash table or an indexed db hash field can be used to accomplish this much more quickly than brute force.

    (Feeling a bit trolled somehow...)

    You replied to a comment on TDWTF, therefore you've been trolled.
    I replied to a comment on TDWTF, therefore I've been trolled. Anybody who disagrees...

  • ÃÆâ€℠(unregistered)

    Nagesh and his outsourcing office strike again!

  • Abso (unregistered) in reply to airdrik
    airdrik:
    Mike:
    Jeff:
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?
    It's not about comparing two strings. It's about comparing one string against many strings. A hash table or an indexed db hash field can be used to accomplish this much more quickly than brute force.

    (Feeling a bit trolled somehow...)

    You replied to a comment on TDWTF, therefore you've been trolled.
    I replied to a comment on TDWTF, therefore I've been trolled. Anybody who disagrees...

    Hey, that's not tr—

    Oh. Oops.

  • C-Octothorpe (unregistered) in reply to ÃÆâ€â„Â
    ÃÆâ€â„Â:
    Nagesh and his colleagues from his outsourcing mud hut strike again!

    FTFY

  • Gunslnger (unregistered)

    [q]Sam wanted to use the hashing logic for a similar problem. [/q]

    TRWTF is code reuse in an enterprise situation, amirite?

  • ÃÆâ€℠(unregistered) in reply to John Hardin
    John Hardin:
    Oh, god, the pain. It is blinding.

    I have got to stop coming by TDWTF.

    You're in luck! You're no longer coming to TDWTF, you're coming to the TOEFDWTF.

    (That's the once every few days WTF)

  • Sudo (unregistered) in reply to aptent
    aptent:
    Once in 1000000000000000000000000000000000000000000000000. Yeah, I really don't think you'll ever have the same hash twice.
    >>> from random import randrange
    >>> randrange(1000000000000000000000000000000000000000000000000)
    479866125362889704564601059308345722216664906502L
    >>> randrange(1000000000000000000000000000000000000000000000000)
    479866125362889704564601059308345722216664906502L
    >>> 
    Well, I never... What are the odds?!
  • blarg (unregistered) in reply to Jeff
    Jeff:
    Just Me:
    A hash of the text will provide a quick way to be sure two texts are different.
    How is it faster to compute two hashes and compare them vs. just comparing the two strings?

    You only need to compute the hash for each string once (expensive) and then comparing the hash (very cheap).

    Comparing the strings un-hashed would be moderately expensive for every string.

    If the data was guaranteed to be short then perhaps you would not benefit much from the hash.

  • blarg (unregistered) in reply to psuedonymous
    psuedonymous:
    Does anyone else hear that odd whooshing sound?

    Is that how it sounded to you?

Leave a comment on “The Nondeterministic Hash”

Log In or post as a guest

Replying to comment #:

« Return to Article