The Daily WTF: Curious Perversions in Information Technology

2009-11-10 Reply Admin

Yea for the people posting solutions which open and read the files. How about using the file information to see if they are the same you know like last modified time, byte lenght etc. using stat, BY_HANDLE_FILE_INFORMATION etc depending on if the two directories are on the same type of drive such as NTFS. Can we have another WTF for all the comment posting code :)

2009-11-10 Reply Admin

Reminds me of one of the first things I did when I started a job. I proposed writing a database script that could automatically do a series of data conversions and imports that were currently being keyed-in manually by an employee for an entire day. When I first proposed this to "Pete"'s supervisor, the response was, "But what would Pete do all day??"

She was serious.

2009-11-10 Reply Admin

BSDGuy:
Svillen? Is that you?

Different SR, different batch file :o)

rfsmit · 2009-11-10 Reply Admin

Ozru:
Or you could just run diff -r -q to get a list of filenames to copy.
Then, depending on your OS and available libraries, you either use the library-provided function to copy files, or you shell out and run cp.

If you're running "diff -r -q" anyway, why not use the shell? Just "cp -p $(diff -r -q $DIRA $DIRB) $DIRC".

Writing an application to do this is insane. But scanning through these comments, it seems the exact same WTF would be repeated, time and time again.

"Maybe we should make it round and put the axle in the middle?" Sheesh.

emanningis · 2009-11-10 Reply Admin

Reading a file in C block by block until you reach the end isn't that messed up, but the block size he uses doesn't make any sense. And yes, he uses a Shlemiel approach to building the result vector. But as said before, a much bigger block size would have taken care of the performance problem.

It seems as if the author knew some programming basics and hacked and tweaked his way towards a tool that just did what he needed it to do. Probably just worked fine for a small set of input files.

No way that he was hired as a C++ developer and had that job title for eight years.

DaveK · 2009-11-10 Reply Admin

Bim Job:
Not really fair to misquote fennec, is it?

Are you blind or do you just fail at understanding quoting conventions? That was a completely verbatim quote from post #290251. Anyone who cares can check for themselves and see that you're being an idiot:

OP: http://thedailywtf.com/Comments/Reading-Comprehension.aspx?pg=2#290251 My completely verbatim quote: http://thedailywtf.com/Comments/Reading-Comprehension.aspx?pg=2#290261

Given that I didn't misquote anyone, I don't have any idea what your beef is, apart from the fact that I must have vaguely alluded to some political idea that you don't like and you're getting all emotional and angry. Was steam coming out of your ears when you posted? That's always a sign to slow up and think before hitting send.

[ This space intentionally left blank for you to invent some weird new meaning for the verb "to misquote" that somehow could be seen as applying to anything I've done. ]

2009-11-10 Reply Admin

emanningis:
Reading a file in C block by block until you reach the end isn't that messed up, but the block size he uses doesn't make any sense. And yes, he uses a Shlemiel approach to building the result vector. But as said before, a much bigger block size would have taken care of the performance problem.

If all you want is whether 2 files differ, then compute 2 checksums and compare them.

2009-11-10 Reply Admin

What's one more level of indirection between friends?

This isn't really that bad. I remember a c coder with a really bad obfuscation habit. Rather than creating normal local variables (as any sane person would do), he would use a general purpose variable array. Priceless.

2009-11-10 Reply Admin

It returns the number of bytes read but it updates the pResult pointer arg (which can then be used to access the file contents).

emanningis · 2009-11-11 Reply Admin

I wouldn't trust two identical checksums that consist of (say) 128 bits telling me that the contents of two 5MB files are exactly the same.

It only works one way: if the checksums differ, then so do the files. Identical checksums don't tell you all that much.

Bellinghman · 2009-11-11 Reply Admin

emanningis:
I wouldn't trust two identical checksums that consist of (say) 128 bits telling me that the contents of two 5MB files are exactly the same.
It only works one way: if the checksums differ, then so do the files. Identical checksums don't tell you all that much.

But the differing checksums are useful for telling you not to bother going to the next stage:

1: For all files, find their lengths. This should be an O(1) operation: you don't need to open them.

2: For a particular file length that has more than one file of that length, for each file, read and compute the checksum.

3: If you find any checksums for a particular file length that apply to more than one file, then you do the actual file comparison to see if they are byte-for-byte the same. The chances are pretty low that you'll find a difference that the checksum didn't, and it's possibly more likely that a spontaneous disc error happened, but at least you're only having to compare pairs of files that are the same length.

Oh, and since this is a sanity check more than anything else, you can make use of the fact that if (a == b) and (c == d), then if (a == c) you've checked all 4 are equal to each other. So you end up doing N-1 comparisons to prove N files are all the same: you don't have to do all combinations.

2009-11-11 Reply Admin

TRWTF is that insane nonsense of "computing checksums" to compare files! Anyone ever noticed that this is not a no-cost operation?

Bellinghman:
The chances are pretty low that you'll find a difference that the checksum didn't, and it's possibly more likely that a spontaneous disc error happened, but at least you're only having to compare pairs of files that are the same length.

After having read the both files to compute the checksum!

Which means some of the files are read twice while the vanilla compare would only read every file once!

Bellinghman:
Oh, and since this is a sanity check more than anything else, you can make use of the fact that if (a == b) and (c == d), then if (a == c) you've checked all 4 are equal to each other. So you end up doing N-1 comparisons to prove N files are all the same: you don't have to do all combinations.

But this is not the requirement here: For checking if the contents of two directories (and their subdirectories) are identical, only pairs of files will be compared against each other.

There might be advantages of the checksum approach if you are allowed to store the checksums and can re-use them on later comparisons, but it that is not the case, the checksum approach is just a stupid WTF.

2009-11-11 Reply Admin

it's always lazyness or dumbness. none of which is very good to say to your boss our to yourself.

Bellinghman · 2009-11-11 Reply Admin

Design Pattern:
TRWTF is that insane nonsense of "computing checksums" to compare files! Anyone ever noticed that this is not a no-cost operation?
Bellinghman:
The chances are pretty low that you'll find a difference that the checksum didn't, and it's possibly more likely that a spontaneous disc error happened, but at least you're only having to compare pairs of files that are the same length.
After having read the both files to compute the checksum!
Which means some of the files are read twice while the vanilla compare would only read every file once!

Insane nonsense? Actually no, probably not.

Let's assume you have a bunch of files, all the same length. Which ones are the same?

OK, we'll compare file 1 with file 2. Now you know whether those two files are the same, and if they are, great. But for now, let's assume they're not.

Now we have a third file. Is it the same as either of the first two? Well, that's easy, you just compare it with file 1, and with file 2 (unless 1 and 2 were the same, of course).

So, three files, and you've already read each file twice.

Now add file 4. Oh noes, we have to compare it with each of the first three!

See, the purpose of the checksum is to postpone the combinatorial explosion as far as possible.

(Yes, you can certainly optimise the situation for small numbers of files - as you note, if there are only two files, then just compare them directly. But, like Bubble Sort, it scales very badly.)

luis.espinal · 2009-11-12 Reply Admin

emanningis:
I wouldn't trust two identical checksums that consist of (say) 128 bits telling me that the contents of two 5MB files are exactly the same.

How about a SHA512 hash? Would that be enough for you? Man, you can settle down with MD5SUM. To be honest, for many practical purposes a plain ol' CRC would do for small size files. Yes, a CRC is not cryptographically secure, but no, it is still extremely unlikely that two different and yet near identical files will compute the same CRC.

There approach used in most batch-run file comparison systems is as follows:

if file1.size <> file2.size then different = true else if crc(file1) <> crc(file2) then different = true else different = false

On average, your files will be drastically different. The off chance that two files picked at random that happen to be different compute the same checksum is negligible to the point it can be safely ignored.

Unless you are working with, I dunno, real-time constrains (I mean true real-time not some personal bench markings), medical prescriptions or some other life-and-death decision-making computing jobs, a checksum for this is just fine.

emanningis:
It only works one way: if the checksums differ, then so do the files. Identical checksums don't tell you all that much.

They don't tell you that much. The context and probability of such an occurrence does tells you a lot. There is a tool called a ham-carving knife, and there is a tool called a scalpel. You don't use one for a problem that requires the other, do you?

luis.espinal · 2009-11-12 Reply Admin

Design Pattern:
There might be advantages of the checksum approach if you are allowed to store the checksums and can re-use them on later comparisons, but it that is not the case, the checksum approach is just a stupid WTF.

Dude, I don't get you. The goal is to compare pairs of files with the same name but on two directories to see if they are different or not. You don't need to store the checksums (???)

if chksum(file1) <> chksum(file2) ????

I don't quite follow this. Mass comparison of directories containing gigabytes of content has been done with chksum (or more precisely using sha512 or md5sum).

Unless I had a specific and valid requirement for speed, I would have used a scripting approach levering on some concoction of perl and sha512 or md5sum and have it run on the background as a cron job, producing status reports at given intervals.

That's an approach that is done all the time, for the longest time and it's been working well and in an economically sound and maintainable fashion (so long as you do not have a mandate to get this crap to run over 100ks of file in real-time as fast as possible.

2009-11-12 Reply Admin

luis.espinal:
I don't quite follow this.

Perhaps because you haven't followed the whole thread? "Design Pattern" was answering Bellinghman's suggestion.

luis.espinal:
Mass comparison of directories containing gigabytes of content has been done with chksum (or more precisely using sha512 or md5sum).

That would indeed be "just a stupid WTF". If the tool says it compares directories, that's what it should do. Not rely on some statistical improbability of collision, and/or read both colliding files twice! (Unless you're talking about comparisons between 3+ directory trees, in which case storage should indeed be involved.)

luis.espinal · 2009-11-12 Reply Admin

C:
luis.espinal:
I don't quite follow this.
Perhaps because you haven't followed the whole thread? "Design Pattern" was answering Bellinghman's suggestion.
luis.espinal:
Mass comparison of directories containing gigabytes of content has been done with chksum (or more precisely using sha512 or md5sum).
That would indeed be "just a stupid WTF". If the tool says it compares directories, that's what it should do. Not rely on some statistical improbability of collision, and/or read both colliding files twice! (Unless you're talking about comparisons between 3+ directory trees, in which case storage should indeed be involved.)

No. I'm following the thread. I'm referring on the insistence that relying on a sufficiently satisfactory improbability of collision is just good enough for a mass directory comparison. That, or simply running diff from a script.

Why rely on a sufficiently satisfactory improbability of collision? Simply. Economics. It is cheaper to implement a script like (with md5sum or simply diff if you want to be that anal) than writing it the whole damned enchilada in C++.

This is my experience, my anecdotal evidence so take it with a bit of salt or whatever condiment of your predilection - I've ran, and I've seen run, these type of jobs for years on 10ks of files (sometimes up in hundreds of gbs) of content without ever seeing a collision of two different files of the same size. Different type of data: spreadsheets, database dumps, binaries going into firmware, you name it.

I will grant you that it is possible, however improbable, that there was indeed a collision between two different files of the same size. But had it occurred chances are it would have been found (and reported) down the pipeline. This, I can guarantee you.

I don't need to pull a full-time programmer aside to write and maintain that task distracting from what he needs to code. I simply delegate that to the sysadmin to perform it as a sysadmin task, for example. The ease by which such a script can be developed and tested, and the fact that the probability of such collisions is negligible enough, and in business contexts where the cost of an improbable collision taking place is sufficiently negligible (that is a bug of no consequence), this makes for an economically justifiable solution.

So if you want to say the solution is a WTF in the general case because of a highly improbable collision, independently of context and requirements (which whether you like it or not, they are quite forgiving in the general case because of the improbability of a collision), then be my guest.

Bellinghman · 2009-11-13 Reply Admin

luis.espinal:
So if you want to say the solution is a WTF in the general case because of a highly improbable collision, independently of context and requirements (which whether you like it or not, they are quite forgiving in the general case because of the improbability of a collision), then be my guest.

Indeed. Given a simple 128 bit checksum, that's enough different checksums for 10^38 files.

That's a large number of files, more files by many orders of magnitude than the current global storage capacity as measured in bytes. The chances of any two files that actually differ managing to resolve to the same checksum by accident are statistically negligible. It's vastly more likely that you'll get a read error instead.

However, let's go for paranoia, let's assume some cracker is trying to infiltrate a file and can actually get its checksum to match that of one of your existing files, and let's follow Design Pattern's insistence on actually comparing files to see if they're different. It's still vastly more efficient to (temporarily) note the checksums of all the files of a particular size - because that process requires reading each file only once - and not bothering comparing files that the checksum process has already indicated as being different.

icelava · 2009-11-16 Reply Admin

so Rik V never had the chance to realise this for eight years?

2010-07-06 Reply Admin

vector<char> filedata; FILE*fp = fopen("file.bin","rb"); while(!feof(fp)) { filedata.push_back(fgetc(fp)); }

"but sir! its so neat, and uses stl and everything!!"

Reading Comprehension

Leave a comment on “Reading Comprehension”