• (cs)

    Michael Bolton?

  • Tim (unregistered)

    sure the random filename function could usefully have generated longer names to reduce the possibility of a collision, but using a hash function instead would be guaranteed to make a collision if the original filenames were the same. Presumably that's not what was intended

  • Simon (unregistered) in reply to Tim

    Add a salt of the username/date&time/extrarandomcharacters

  • (cs) in reply to Tim
    Tim:
    but using a hash function instead would be guaranteed to make a collision if the original filenames were the same.
    Agreed; especially if people uploaded pictures take with their camera's, which all tend to follow the same naming scheme, or similar automatically generated filenames. A somewhat better approach is to hash over the file itself, although the effectiveness of that hinges on the hashing algorithm used.
  • mz001 (unregistered)

    His solution was to make it more likely for collisions to occur? A genius like that should feel right at home at Oracle, Adobe or Microsoft!

  • some pony (unregistered) in reply to FragFrog

    an sha512 hash of the data in the file gives you 99.9% chance yu won't have duplicates for up to 5.2e75 (different) files.

    Sure, if you have the same file multiple times, it will collide with itself. But does it really matter whether a hash refers to the first or second copy of exctly the same data?

  • RFoxmich (unregistered)

    Sometimes that's what you want...for the same file uploaded by the same user to make a hash collision and overwrite or prompt if that's intended. In such cases it's good not to use the date/time in the salt but only the username.

    Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.

  • Leo (unregistered)

    And then Daenerys said, "Draughon, dracarys" and Draughon breathed fire and burnt the building down.

  • Tea (unregistered)

    UUID anyone?

  • AC (unregistered)

    The original hash only gave 33.5 million possible hashes, so "tens of millions of files" not causing a collision seems very very unlikely...

  • CodeCow (unregistered)

    Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

    Considering he's not a pilot or something like that, that seems pretty fucked up

  • Cam (unregistered)

    Here be Draughons.

  • Frank (unregistered)

    TRWTF is, that both are idiots.

  • Xyon (unregistered)

    His valid chars also didn't include the letter "H". I guess he doesn't trust that either.

  • Davor (unregistered)

    I wonder why he skipped h, l, u and v?

  • (cs) in reply to Davor
    Davor:
    I wonder why he skipped h, l, u and v?

    hluv I know...

  • Kactus (unregistered)

    We had a simple public facing but in house developed system for tftp configs that would take requests and generate files from a db. Because of the dangers of putting anything from the internet into db queries it was a requirement we hashed every request and did a lookup on the hash, and because of the danger of collisions we hashed to both md5 and sha and did the lookup on both hashes.

    TRWTF? The system at it's peak handled less than 50 devices total, and they simply sent their mac address as the tftp param.

    (also when I said we I meant I >.>)

  • Doctor_of_ineptitude (unregistered) in reply to mz001
    mz001:
    His solution was to make it more likely for collisions to occur? A genius like that should feel right at home at Oracle, Adobe or Microsoft!

    How do you know that Oracle, Adobe or Microsoft haven't already hired such a genius. Maybe all three of them at the same time.

  • (cs)

    Better titles for this article:

    • Pass the Hash
    • Hash Table Collision
    • Smokin' Hash
    • 420 Error

    'Re-Inventing the Alphabet'? ... c'mon guys!

  • ZoomST (unregistered)

    At first I called it fake seeing the crapper being fired -- not likely to happen in real life. But then I realized that he was being storing data files with random names (that I should point is not a hashing in any sense); suspicious files for some anti-virus systems... So I think he was fired for inserting a virus into the server. Now it has sense for me.

  • ZoomST (unregistered) in reply to Tankster
    Tankster:
    Better titles for this article:
    • Pass the Hash
    • Hash Table Collision
    • Smokin' Hash
    • 420 Error

    'Re-Inventing the Alphabet'? ... c'mon guys!

    As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

  • (cs) in reply to ZoomST
    ZoomST:
    Tankster:
    Better titles for this article:
    • Pass the Hash
    • Hash Table Collision
    • Smokin' Hash
    • 420 Error

    'Re-Inventing the Alphabet'? ... c'mon guys!

    As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

    • Joint WTF?
  • Jasper (unregistered)

    Let's see.

    abcdefgijkmnopqrstwxyz1234567890 - that's 32 characters.

    The filenames consist of 5 characters randomly chosen out of those 32. That means there are 32^5 = 33,554,432 possible filenames.

    So, if there were tens of millions of files, the chance would be quite high (or even certain) that there must have been collisions.

    Maybe, when there was a collision, the software would just overwrite an existing file with the same name?

  • Jasper (unregistered) in reply to ZoomST
    ZoomST:
    As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.
    That was probably intentional, to make it a nice round number of characters (32).
  • Davor (unregistered) in reply to Jasper
    Jasper:
    [..]

    Maybe, when there was a collision, the software would just overwrite an existing file with the same name?

    I'm guessing this is what happened.

    With so many files, I could imagine any clients using the service would just "store and forget". Kind of like an archive in a bureaucratic environment.

    "Could you get file x from 1963 from the archives, Greg?" < "Sure, be right back" (staff giggles) < 2 years later, Greg suddenly reappears "Greg, we thought you were fired or something, where did you go?" < "To get file X from the archives, it wasn't there anymore, instead I got file Y from 1974"

  • Math (unregistered) in reply to some pony

    "99.9%"

    So, once out of every 1000 it will fail?

  • (cs) in reply to CodeCow
    CodeCow:
    Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?
    He should have trusted hash!
  • Anony (unregistered) in reply to Math

    No, if you generate 5.2e75 different files 1000 times, one will cause a collision.

  • ¯\(°_o)/¯ I DUNNO LOL (unregistered)

    Do not MD5 in the files of Draughons, for you are crunchy, and good with ketchup.

    Woah man, I got the munchies sooooo bad...

  • (cs) in reply to skotl
    skotl:
    Davor:
    I wonder why he skipped h, l, u and v?

    hluv I know...

    If this doesn't become a featured comment, we'll know Alex is on vacation.

  • ¯\(°_o)/¯ I DUNNO LOL (unregistered) in reply to RFoxmich
    RFoxmich:
    Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.
    Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.

    An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

  • Mike (unregistered) in reply to ¯\(°_o)/¯ I DUNNO LOL
    ¯\(°_o)/¯ I DUNNO LOL:
    RFoxmich:
    Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.
    Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.

    An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

    You can use whatever hash algorithm you want, if you hash the same file name you get the same hash

  • SHA (unregistered) in reply to Tankster
    Tankster:
    ZoomST:
    Tankster:
    Better titles for this article:
    • Pass the Hash
    • Hash Table Collision
    • Smokin' Hash
    • 420 Error

    'Re-Inventing the Alphabet'? ... c'mon guys!

    As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

    • Joint WTF?

    Come on, "Rolling your own Hash" would still have been good.

  • OldCoder (unregistered) in reply to Mike
    Mike:
    ¯\(°_o)/¯ I DUNNO LOL:
    RFoxmich:
    Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.
    Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.

    An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

    You can use whatever hash algorithm you want, if you hash the same file name you get the same hash

    Well, duh. If you want to overwrite a file, you have to end up with the same hash.

  • (cs)

    It seems that he wanted to overwrite files with the same name (hmmm, what if they belong to different users?) So why not just use the original file name?

    Also, it's not clear how this was supposed to work. A client uploads a file, then browses all existing file names? Or are they stored in a DB? How did this work originally?

  • Gunnar Tveiten (unregistered) in reply to AC

    Beyond unlikely. The birthday-paradox says that though there's on the order of 2^26 possible filenames, you'd expect to, on the average, get the first collisions at file 2^13, or after having uploaded about 10k files.

    After that, collisions gets more and more likely. To get to "tens of millions of files" with zero collisions is astronomically unlikely.

  • Vlad Patryshev (unregistered)

    Oh. I recognize the solution. The architect in my previous company did that. Guess who was escorted, though.

  • (cs)

    With 60.4 million possible filenames (36^5) generated by this function, and tens of millions of uploads, there's almost certainly a name collision.

    Michael just hasn't found them yet.

  • (cs)

    l is missing because it looks like I and 1; u is missing so that none of the files end up with "fuck" in their names; h and v are missing because... errr...

    OK, I got nothing.

    As for using a hash over the file contents instead: I believe Dropbox does this very thing, which is why if you upload a movie that somebody else has already stolen and uploaded before you, your own upload will complete impossibly quickly. It's a reasonably effective de-duplication scheme that has to be saving Dropbox a shitload of disk space, and is probably the main reason why they don't automatically encrypt your stuff client-side.

    Addendum (2013-05-01 11:31): On further thought: h is missing so none of the files end up with "shit" in their names, and v is missing because it looks too much like u.

  • Captcha: acsi (unregistered)
    random drug screenings
    TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.
  • (cs) in reply to Captcha: acsi
    Captcha: acsi:
    random drug screenings
    TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.

    What are you talking about, you pinko commie asshole? There is no country in the world whose corporations have more freedom to oppress, exploit and enslave their workers than the USA.

    USA! USA! USA!

  • Mike (unregistered) in reply to flabdablet

    Dear flabdablet, it took you 10 minutes to write that comment?

  • (cs) in reply to Captcha: acsi
    Captcha: acsi:
    random drug screenings
    TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.

    In the USA, we see this as the freedom of the business owner to demand a drug test any time he wants.

  • Ken B (unregistered) in reply to Davor
    Davor:
    I wonder why he skipped h, l, u and v?
    Because, including '0' through '9', that would make 36 characters. By eliminating four, he now has a nice round 32 characters.
  • Ken B (unregistered) in reply to ZoomST
    ZoomST:
    At first I called it fake seeing the crapper being fired -- not likely to happen in real life. But then I realized that he was being storing data files with random names (that I should point is not a hashing in any sense); suspicious files for some anti-virus systems... So I think he was fired for inserting a virus into the server. Now it has sense for me.
    Sorry, not possible, as 'v' and 'u' are never part of the filename. :-)
  • Ken B (unregistered) in reply to SHA
    SHA:
    Tankster:
    ZoomST:
    Tankster:
    Better titles for this article:
    • Pass the Hash
    • Hash Table Collision
    • Smokin' Hash
    • 420 Error

    'Re-Inventing the Alphabet'? ... c'mon guys!

    As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

    • Joint WTF?
    Come on, "Rolling your own Hash" would still have been good.
    You can't roll you own hash without the letter 'h'.
  • Hannes (unregistered) in reply to CodeCow
    CodeCow:
    Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

    I worked through some code that one of my pre-pre-pre-decessors at my current workplace wrote... boy, that's pure TDWTF material there. I will submit some of it in the future (one method has 45! nested if-else-statements...). But he wasn't fired for being "a shitty coder". In fact, his programs do work. They are slow as hell and can't be maintained (because nobody wants dig through the 1000 lines of uncommented code that guy wrote for a task as simple as reading a CSV-file into a DataGrid). But since the programs do what they should do, nobody who could fire him knew, that he was a shitty coder.

    TL;DR: For Draughon to be fired for being "a shitty coder" you'd need someone in charge to KNOW that he's a "shitty coder". Very unlikely.

  • Evan (unregistered) in reply to some pony
    some pony:
    Sure, if you have the same file multiple times, it will collide with itself. But does it really matter whether a hash refers to the first or second copy of exctly the same data?
    Except that's not what would happen.

    The hash is over the file name, not contents. It's totally possible that you could have two files (e.g. track01.mp3) with the same name and totally different contents, and those would collide.

  • AllGeneralizationsAreBad (unregistered) in reply to CodeCow
    CodeCow:
    Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

    Considering he's not a pilot or something like that, that seems pretty fucked up

    It mentions he had rolling papers in his desk. Circumstantial evidence, yes, but come on, why would you have the paraphernalia at work? So you could roll one when you get to the car where you have the actual weed? That guy's a disaster waiting to happen. Maybe he signed a drug-free workplace agreement, or more likely this was in a right-to-work state.

    Captcha: esse esse marlboro officer, I swear man. I love you man.

  • (cs) in reply to Kactus
    Kactus:
    Because of the dangers of putting anything from the internet into db queries it was a requirement we hashed every request and did a lookup on the hash, and because of the danger of collisions we hashed to both md5 and sha and did the lookup on both hashes.
    Wait, what? What could possibly have been the thought process behind that requirement? They do realize that pretty much every database-powered website in existence accepts content from the internet without hashing, right?

    While I will not deny that some (okay, many) suffer from SQL injection vulnerabilities, when it concerns just a single field, I would think that even the person who wrote that requirement should have been able to find one of the dozen ways to safely insert internet data into a database.

    Hannes:
    TL;DR: For Draughon to be fired for being "a shitty coder" you'd need someone in charge to KNOW that he's a "shitty coder". Very unlikely.
    Wait, you never talk to your managers? They never ask why maintenance of some modules takes three times as long as work done by the other devs? I mean, it might not go all the way up the chain, but presumable someone was in charge of the development team, would they not find out eventually?

    Heck, I have a few components that through years of change requests have become somewhat less clear than they could be, and I make sure my boss knows this when he asks for a quote on yet another change: until I get time alloted to clean it up, changes to these modules will take longer than similar changes in the rest of our code-base.

Leave a comment on “Re-Inventing the Alphabet”

Log In or post as a guest

Replying to comment #407026:

« Return to Article