File Type Detection

Remy Porter

Computers were a mistake, so I accidentally became a farmer? Editor-in-Chief for TDWTF.

Discerning the type of data stored in a file is frequently a challenge. We've come up with all sorts of ways to do it- like including magic bytes at the start of a file, using file extensions, appending MIME type information where possible, and frequently just hoping for the best. Ivan was working on a Python system that needed to handle XML data. Someone wanted to make sure that the XML data was actually XML, and not some other file format.

def is_xml(str):
    return str.startswith("<")

Any string of text which starts with < is clearly an XML file. This certainly won't give any false positives. If we assume that they at least trimed whitespace off, I think we can be fairly safe that there won't be any false negatives at least. Though if there is some way to generate a valid XML document where the first non-whitespace character isn't a <, I'd be curious to see it.

The real question is: what if this check is actually successful at filtering out a large amount of invalid files? If this check is basically useless, that's a WTF. If this check is actually valuable- that's a bigger WTF.