The Daily WTF: Curious Perversions in Information Technology

Dogsworth · 2013-05-01 Reply Admin

Michael Bolton?

2013-05-01 Reply Admin

sure the random filename function could usefully have generated longer names to reduce the possibility of a collision, but using a hash function instead would be guaranteed to make a collision if the original filenames were the same. Presumably that's not what was intended

2013-05-01 Reply Admin

Add a salt of the username/date&time/extrarandomcharacters

FragFrog · 2013-05-01 Reply Admin

Tim:
but using a hash function instead would be guaranteed to make a collision if the original filenames were the same.

Agreed; especially if people uploaded pictures take with their camera's, which all tend to follow the same naming scheme, or similar automatically generated filenames. A somewhat better approach is to hash over the file itself, although the effectiveness of that hinges on the hashing algorithm used.

2013-05-01 Reply Admin

His solution was to make it more likely for collisions to occur? A genius like that should feel right at home at Oracle, Adobe or Microsoft!

2013-05-01 Reply Admin

an sha512 hash of the data in the file gives you 99.9% chance yu won't have duplicates for up to 5.2e75 (different) files.

Sure, if you have the same file multiple times, it will collide with itself. But does it really matter whether a hash refers to the first or second copy of exctly the same data?

2013-05-01 Reply Admin

Sometimes that's what you want...for the same file uploaded by the same user to make a hash collision and overwrite or prompt if that's intended. In such cases it's good not to use the date/time in the salt but only the username.

Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.

2013-05-01 Reply Admin

And then Daenerys said, "Draughon, dracarys" and Draughon breathed fire and burnt the building down.

2013-05-01 Reply Admin

UUID anyone?

2013-05-01 Reply Admin

The original hash only gave 33.5 million possible hashes, so "tens of millions of files" not causing a collision seems very very unlikely...

2013-05-01 Reply Admin

Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

Considering he's not a pilot or something like that, that seems pretty fucked up

2013-05-01 Reply Admin

Here be Draughons.

2013-05-01 Reply Admin

TRWTF is, that both are idiots.

2013-05-01 Reply Admin

His valid chars also didn't include the letter "H". I guess he doesn't trust that either.

2013-05-01 Reply Admin

I wonder why he skipped h, l, u and v?

skotl · 2013-05-01 Reply Admin

Davor:
I wonder why he skipped h, l, u and v?

hluv I know...

2013-05-01 Reply Admin

We had a simple public facing but in house developed system for tftp configs that would take requests and generate files from a db. Because of the dangers of putting anything from the internet into db queries it was a requirement we hashed every request and did a lookup on the hash, and because of the danger of collisions we hashed to both md5 and sha and did the lookup on both hashes.

TRWTF? The system at it's peak handled less than 50 devices total, and they simply sent their mac address as the tftp param.

(also when I said we I meant I >.>)

2013-05-01 Reply Admin

mz001:
His solution was to make it more likely for collisions to occur? A genius like that should feel right at home at Oracle, Adobe or Microsoft!

How do you know that Oracle, Adobe or Microsoft haven't already hired such a genius. Maybe all three of them at the same time.

Tankster · 2013-05-01 Reply Admin

Better titles for this article:

Pass the Hash
Hash Table Collision
Smokin' Hash
420 Error

'Re-Inventing the Alphabet'? ... c'mon guys!

2013-05-01 Reply Admin

At first I called it fake seeing the crapper being fired -- not likely to happen in real life. But then I realized that he was being storing data files with random names (that I should point is not a hashing in any sense); suspicious files for some anti-virus systems... So I think he was fired for inserting a virus into the server. Now it has sense for me.

2013-05-01 Reply Admin

Tankster:
Better titles for this article:

Pass the Hash

Hash Table Collision

Smokin' Hash

420 Error

'Re-Inventing the Alphabet'? ... c'mon guys!

As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

Tankster · 2013-05-01 Reply Admin

ZoomST:
Tankster:
Better titles for this article:

Pass the Hash

Hash Table Collision

Smokin' Hash

420 Error

'Re-Inventing the Alphabet'? ... c'mon guys!
As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

Joint WTF?

2013-05-01 Reply Admin

Let's see.

abcdefgijkmnopqrstwxyz1234567890 - that's 32 characters.

The filenames consist of 5 characters randomly chosen out of those 32. That means there are 32^5 = 33,554,432 possible filenames.

So, if there were tens of millions of files, the chance would be quite high (or even certain) that there must have been collisions.

Maybe, when there was a collision, the software would just overwrite an existing file with the same name?

2013-05-01 Reply Admin

ZoomST:
As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

That was probably intentional, to make it a nice round number of characters (32).

2013-05-01 Reply Admin

Jasper:
[..]
Maybe, when there was a collision, the software would just overwrite an existing file with the same name?

I'm guessing this is what happened.

With so many files, I could imagine any clients using the service would just "store and forget". Kind of like an archive in a bureaucratic environment.

"Could you get file x from 1963 from the archives, Greg?" < "Sure, be right back" (staff giggles) < 2 years later, Greg suddenly reappears "Greg, we thought you were fired or something, where did you go?" < "To get file X from the archives, it wasn't there anymore, instead I got file Y from 1974"

2013-05-01 Reply Admin

"99.9%"

So, once out of every 1000 it will fail?

DaveK · 2013-05-01 Reply Admin

CodeCow:
Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

He should have trusted hash!

2013-05-01 Reply Admin

No, if you generate 5.2e75 different files 1000 times, one will cause a collision.

2013-05-01 Reply Admin

Do not MD5 in the files of Draughons, for you are crunchy, and good with ketchup.

Woah man, I got the munchies sooooo bad...

operagost · 2013-05-01 Reply Admin

skotl:
Davor:
I wonder why he skipped h, l, u and v?

hluv I know...

If this doesn't become a featured comment, we'll know Alex is on vacation.

2013-05-01 Reply Admin

RFoxmich:
Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.

Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.

An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

2013-05-01 Reply Admin

¯\(°_o)/¯ I DUNNO LOL:
RFoxmich:
Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.
Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.
An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

You can use whatever hash algorithm you want, if you hash the same file name you get the same hash

2013-05-01 Reply Admin

Tankster:
ZoomST:
Tankster:
Better titles for this article:

Pass the Hash

Hash Table Collision

Smokin' Hash

420 Error

'Re-Inventing the Alphabet'? ... c'mon guys!
As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

Joint WTF?

Come on, "Rolling your own Hash" would still have been good.

2013-05-01 Reply Admin

Mike:
¯\(°_o)/¯ I DUNNO LOL:
RFoxmich:
Problem is, however that as good as MD5 is you can still have collisions on different names (see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for a paper that describes an algorithm for generating collisions to an existing MD5 hash value e.g.). So you've got to have some way to recover the original name...and MD5 doesn't give that to you.
Except that those are hashes on file data, and these are hashes on file names. The intentional collisions are done for entire files where you have thousands of bytes to work with.
An MD5 hash is 128 bits (16 bytes), and a typical MP3 file name may be in the 15-25 byte range. But four of those bytes (".mp3") have no entropy, and the rest have about 5 bits of entropy. So that's about 50-100 bits of entropy, which means that MD5 hashes of MP3 file names are really unlikely to have accidental collisions. To make an intentional collision, you would likely have to use the full range of 8-bit characters.

You can use whatever hash algorithm you want, if you hash the same file name you get the same hash

Well, duh. If you want to overwrite a file, you have to end up with the same hash.

levbor · 2013-05-01 Reply Admin

It seems that he wanted to overwrite files with the same name (hmmm, what if they belong to different users?) So why not just use the original file name?

Also, it's not clear how this was supposed to work. A client uploads a file, then browses all existing file names? Or are they stored in a DB? How did this work originally?

2013-05-01 Reply Admin

Beyond unlikely. The birthday-paradox says that though there's on the order of 2^26 possible filenames, you'd expect to, on the average, get the first collisions at file 2^13, or after having uploaded about 10k files.

After that, collisions gets more and more likely. To get to "tens of millions of files" with zero collisions is astronomically unlikely.

2013-05-01 Reply Admin

Oh. I recognize the solution. The architect in my previous company did that. Guess who was escorted, though.

dynedain · 2013-05-01 Reply Admin

With 60.4 million possible filenames (36^5) generated by this function, and tens of millions of uploads, there's almost certainly a name collision.

Michael just hasn't found them yet.

flabdablet · 2013-05-01 Reply Admin

l is missing because it looks like I and 1; u is missing so that none of the files end up with "fuck" in their names; h and v are missing because... errr...

OK, I got nothing.

As for using a hash over the file contents instead: I believe Dropbox does this very thing, which is why if you upload a movie that somebody else has already stolen and uploaded before you, your own upload will complete impossibly quickly. It's a reasonably effective de-duplication scheme that has to be saving Dropbox a shitload of disk space, and is probably the main reason why they don't automatically encrypt your stuff client-side.

Addendum (2013-05-01 11:31): On further thought: h is missing so none of the files end up with "shit" in their names, and v is missing because it looks too much like u.

2013-05-01 Reply Admin

random drug screenings

TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.

flabdablet · 2013-05-01 Reply Admin

Captcha: acsi:
random drug screenings
TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.

What are you talking about, you pinko commie asshole? There is no country in the world whose corporations have more freedom to oppress, exploit and enslave their workers than the USA.

USA! USA! USA!

2013-05-01 Reply Admin

Dear flabdablet, it took you 10 minutes to write that comment?

Salami · 2013-05-01 Reply Admin

Captcha: acsi:
random drug screenings
TRWTF right there. I wouldn't want to live in a country with so little freedom as the US.

In the USA, we see this as the freedom of the business owner to demand a drug test any time he wants.

2013-05-01 Reply Admin

Davor:
I wonder why he skipped h, l, u and v?

Because, including '0' through '9', that would make 36 characters. By eliminating four, he now has a nice round 32 characters.

2013-05-01 Reply Admin

ZoomST:
At first I called it fake seeing the crapper being fired -- not likely to happen in real life. But then I realized that he was being storing data files with random names (that I should point is not a hashing in any sense); suspicious files for some anti-virus systems... So I think he was fired for inserting a virus into the server. Now it has sense for me.

Sorry, not possible, as 'v' and 'u' are never part of the filename. :-)

2013-05-01 Reply Admin

SHA:
Tankster:
ZoomST:
Tankster:
Better titles for this article:

Pass the Hash

Hash Table Collision

Smokin' Hash

420 Error

'Re-Inventing the Alphabet'? ... c'mon guys!
As pointed by Davor at 407007, you missed TRWTF: the alphabet is incomplete.

Joint WTF?
Come on, "Rolling your own Hash" would still have been good.

You can't roll you own hash without the letter 'h'.

2013-05-01 Reply Admin

CodeCow:
Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?

I worked through some code that one of my pre-pre-pre-decessors at my current workplace wrote... boy, that's pure TDWTF material there. I will submit some of it in the future (one method has 45! nested if-else-statements...). But he wasn't fired for being "a shitty coder". In fact, his programs do work. They are slow as hell and can't be maintained (because nobody wants dig through the 1000 lines of uncommented code that guy wrote for a task as simple as reading a CSV-file into a DataGrid). But since the programs do what they should do, nobody who could fire him knew, that he was a shitty coder.

TL;DR: For Draughon to be fired for being "a shitty coder" you'd need someone in charge to KNOW that he's a "shitty coder". Very unlikely.

2013-05-01 Reply Admin

some pony:
Sure, if you have the same file multiple times, it will collide with itself. But does it really matter whether a hash refers to the first or second copy of exctly the same data?

Except that's not what would happen.

The hash is over the file name, not contents. It's totally possible that you could have two files (e.g. track01.mp3) with the same name and totally different contents, and those would collide.

2013-05-01 Reply Admin

CodeCow:
Ok... so the guy was fired not because he was a shitty coder, but because he smoked weed?
Considering he's not a pilot or something like that, that seems pretty fucked up

It mentions he had rolling papers in his desk. Circumstantial evidence, yes, but come on, why would you have the paraphernalia at work? So you could roll one when you get to the car where you have the actual weed? That guy's a disaster waiting to happen. Maybe he signed a drug-free workplace agreement, or more likely this was in a right-to-work state.

Captcha: esse esse marlboro officer, I swear man. I love you man.

FragFrog · 2013-05-01 Reply Admin

Kactus:
Because of the dangers of putting anything from the internet into db queries it was a requirement we hashed every request and did a lookup on the hash, and because of the danger of collisions we hashed to both md5 and sha and did the lookup on both hashes.

Wait, what? What could possibly have been the thought process behind that requirement? They do realize that pretty much every database-powered website in existence accepts content from the internet without hashing, right?

While I will not deny that some (okay, many) suffer from SQL injection vulnerabilities, when it concerns just a single field, I would think that even the person who wrote that requirement should have been able to find one of the dozen ways to safely insert internet data into a database.

Hannes:
TL;DR: For Draughon to be fired for being "a shitty coder" you'd need someone in charge to KNOW that he's a "shitty coder". Very unlikely.

Wait, you never talk to your managers? They never ask why maintenance of some modules takes three times as long as work done by the other devs? I mean, it might not go all the way up the chain, but presumable someone was in charge of the development team, would they not find out eventually?

Heck, I have a few components that through years of change requests have become somewhat less clear than they could be, and I make sure my boss knows this when he asks for a quote on yet another change: until I get time alloted to clean it up, changes to these modules will take longer than similar changes in the rest of our code-base.

Re-Inventing the Alphabet

Leave a comment on “Re-Inventing the Alphabet”