The Daily WTF: Curious Perversions in Information Technology

2023-02-16 Reply Admin

Everyone know you should check the file-extension first.

2023-02-16 Reply Admin

Whatever happened to the good old days of reading the entire file contents, then one-by-one checking if you can parse as JSON, CSV, XML, and Base64?

2023-02-16 Reply Admin

. . . and if there's no file involved? The code presented shows no such dependency, merely a test of a stream of characters.

Rick · 2023-02-16 Reply Admin

This makes me want to do a pull request on the source for the Linux "file" command. Currently it requires the <?xml tag to recognize xml.

MaxiTB · 2023-02-16 Reply Admin

Text files can still have so many encodings, any detection like that is pretty as much a wild guess as are file extensions.

Melissa U · 2023-02-16 Reply Admin

This is not necessarily a WTF. We don't know where the possible streams of data are coming from. Maybe it's already cleanr that it's some kind of structured and condensed data, either JSON, XML, or Linux property files, then it's totally okay to rely on the first character.

MaxiTB · 2023-02-16 Reply Admin

Again, encoding :-)

2023-02-16 Reply Admin

"Any string of text which starts with < is clearly an XML file."

I don't get it. ALL of my php files start with a "<", i.e. "<?php"

2023-02-16 Reply Admin

This could be acceptable if the input's already restricted. Say you're receiving inputs from another system that might emit data in any of a handful of different formats. If among those formats XML is the only one that starts with 0x3C, this seems like a reasonable and cheap test to decide whether to ask the parser to have a go at it.

A comment to that effect mould be nice though, in case one day the upstream system is extended to emit a conflicting format. Even if at that point you need a more sophisticated test to differentiate it from XML, it doesn't mean the straightforward approach was wrong initially.

2023-02-16 Reply Admin

"they at least trimed whitespace off"

Non-python coder detected. ;)

dkf · 2023-02-16 Reply Admin

Currently it requires the <?xml tag to recognize xml.

Not all XML files start with that. They probably should, but they don't, especially when they've got to be fed through someone's crufty home-rolled XML parser (that totally isn't just a bunch of regexps).

2023-02-16 Reply Admin

then one-by-one checking if you can parse as JSON, CSV, XML, and Base64

Base64 is an encoding not a format so wouldn't you have to decode the base 64 then try to parse the format :P

2023-02-16 Reply Admin

It's a minor wtf to call a variable str, as it's the Python string type. Or is Python TRWTF for using simple names like str, id, and letting people redefine them?

2023-02-16 Reply Admin

Well, the file could always start with a BOM...

2023-02-16 Reply Admin

I'm pretty sure that an XML parser would do just as good a job in just as fast a time, since it actually knows how an XML file is structured. Worst case, I suppose, is that it would read the entire file in before parsing it, but even that is unlikely to be a performance killer. Best case is, you actually get to use a parser that understands the file format ... which would include BOMs and so on.

2023-02-16 Reply Admin

Reading an entire file just to find out its type is a really terrible idea. In the vast majority of cases, you can get a good enough answer by examining a fixed initial part, which is O(1) rather than O(N) of reading it all.

And to answer the original question, all XML documents must start with '<' (apart from whitespace). The specification says that a document starts with 'prolog', which starts with optional XMLDecl ("<?xml", so starts with '<'), followed by optional 'Misc', which is a "Comment" ("<!--"...), "PI" ("<?"...), or "S" (whotespace); and then a "doctypedecl" (which is "<!doctype"...), so the first non-whitespace must always be "<".

2023-02-16 Reply Admin

And then good luck guessing whether it's ASCII, Latin-1, UTF-8, UTF-16 etc before you even start thinking about what "<" is. :) (Hint: UTF files can have BOM prefixes.)

LorenPechtel · 2023-02-16 Reply Admin

Count me amongst those who don't see a problem here without knowing the context--more than once I've done parsing where the input is one of a limited variety of possible things, any attribute that's unique to only one option is perfectly adequate in such a situation. Yeah, it puked on the .xps that was really a scanned image of the form rather than the computer-generated form, but so what? It was a fatal error no matter what, a more sophisticated parser would simply have been a bit more graceful about how it squawked.

2023-02-16 Reply Admin

that's the joke.gif

2023-02-17 Reply Admin

I was thinking the same way. As we do not know where the input comes this could be the cheapest way to veridy that the input really is xml and not some other (propietary) format.

Sometimes what seems like WTF is really OK in that context, judging without context should not be donw.

2023-02-17 Reply Admin

Looks very ad-hoc but for all we know it just might be Good Enough.

2023-02-20 Reply Admin

Your sarcasm detector is broken.

2023-02-26 Reply Admin

SGML (other than XML). I still have a lot of SGML files around. HTML. HTML might or might not be well formed XML but probably isn't. RDF in N-Triples or N-Quads format. Textual .plist files: a binary datum is written between angle brackets and a plist file could hold just a binary datum. All of these could begin with a "<" character.

Then of course, just because something is represented in XML, that's no guarantee that it's XML you can make any sense of.

2023-02-26 Reply Admin

Here’s what you are supposed to do on iOS: you call a method that returns the mimetype. Doing that ensures that every single application agrees. So it won’t happen that one app thinks something is an image and another thinks it is executable code.

And then if it has the right mime type then you through it into sn xml parser, and it will either get parsed or not.

2023-02-26 Reply Admin

If it’s ASCII then there should be a line like encoding=ascii, encoded in ascii. That means nobody must ever create an encoding named ASCJJ that has the code points for I and j swapped :-)

yaytay · 2023-03-18 Reply Admin

The unicode has already been parsed (because its a string) so it can't (be valid and) start with a BOM.

File Type Detection

Leave a comment on “File Type Detection”