Discerning the type of data stored in a file is frequently a challenge. We've come up with all sorts of ways to do it- like including magic bytes at the start of a file, using file extensions, appending MIME type information where possible, and frequently just hoping for the best. Ivan was working on a Python system that needed to handle XML data. Someone wanted to make sure that the XML data was actually XML, and not some other file format.
def is_xml(str):
return str.startswith("<")
Any string of text which starts with <
is clearly an XML file. This certainly won't give any false positives. If we assume that they at least trim
ed whitespace off, I think we can be fairly safe that there won't be any false negatives at least. Though if there is some way to generate a valid XML document where the first non-whitespace character isn't a <
, I'd be curious to see it.
The real question is: what if this check is actually successful at filtering out a large amount of invalid files? If this check is basically useless, that's a WTF. If this check is actually valuable- that's a bigger WTF.