• Industrial Automation Engineer (unregistered)

    Everyone know you should check the file-extension first.

  • Allie C (unregistered)

    Whatever happened to the good old days of reading the entire file contents, then one-by-one checking if you can parse as JSON, CSV, XML, and Base64?

  • dpm (unregistered) in reply to Industrial Automation Engineer

    . . . and if there's no file involved? The code presented shows no such dependency, merely a test of a stream of characters.

  • (nodebb)

    This makes me want to do a pull request on the source for the Linux "file" command. Currently it requires the <?xml tag to recognize xml.

  • (nodebb) in reply to Rick

    Text files can still have so many encodings, any detection like that is pretty as much a wild guess as are file extensions.

  • (nodebb)

    This is not necessarily a WTF. We don't know where the possible streams of data are coming from. Maybe it's already cleanr that it's some kind of structured and condensed data, either JSON, XML, or Linux property files, then it's totally okay to rely on the first character.

  • (nodebb) in reply to Melissa U

    Again, encoding :-)

  • Dick Yates (unregistered)

    "Any string of text which starts with < is clearly an XML file."

    I don't get it. ALL of my php files start with a "<", i.e. "<?php"

  • Vicki (unregistered)

    This could be acceptable if the input's already restricted. Say you're receiving inputs from another system that might emit data in any of a handful of different formats. If among those formats XML is the only one that starts with 0x3C, this seems like a reasonable and cheap test to decide whether to ask the parser to have a go at it.

    A comment to that effect mould be nice though, in case one day the upstream system is extended to emit a conflicting format. Even if at that point you need a more sophisticated test to differentiate it from XML, it doesn't mean the straightforward approach was wrong initially.

  • GvR (unregistered)

    "they at least trimed whitespace off"

    Non-python coder detected. ;)

  • (nodebb) in reply to Rick

    Currently it requires the <?xml tag to recognize xml.

    Not all XML files start with that. They probably should, but they don't, especially when they've got to be fed through someone's crufty home-rolled XML parser (that totally isn't just a bunch of regexps).

  • ZZartin (unregistered)

    then one-by-one checking if you can parse as JSON, CSV, XML, and Base64

    Base64 is an encoding not a format so wouldn't you have to decode the base 64 then try to parse the format :P

  • fa (unregistered)

    It's a minor wtf to call a variable str, as it's the Python string type. Or is Python TRWTF for using simple names like str, id, and letting people redefine them?

  • Vilx- (unregistered)

    Well, the file could always start with a BOM...

  • Sole Purpose Of Visit (unregistered) in reply to Vicki

    I'm pretty sure that an XML parser would do just as good a job in just as fast a time, since it actually knows how an XML file is structured. Worst case, I suppose, is that it would read the entire file in before parsing it, but even that is unlikely to be a performance killer. Best case is, you actually get to use a parser that understands the file format ... which would include BOMs and so on.

  • Charles (unregistered) in reply to Sole Purpose Of Visit

    Reading an entire file just to find out its type is a really terrible idea. In the vast majority of cases, you can get a good enough answer by examining a fixed initial part, which is O(1) rather than O(N) of reading it all.

    And to answer the original question, all XML documents must start with '<' (apart from whitespace). The specification says that a document starts with 'prolog', which starts with optional XMLDecl ("<?xml", so starts with '<'), followed by optional 'Misc', which is a "Comment" ("<!--"...), "PI" ("<?"...), or "S" (whotespace); and then a "doctypedecl" (which is "<!doctype"...), so the first non-whitespace must always be "<".

  • Randal L. Schwartz (google)

    And then good luck guessing whether it's ASCII, Latin-1, UTF-8, UTF-16 etc before you even start thinking about what "<" is. :) (Hint: UTF files can have BOM prefixes.)

  • (nodebb)

    Count me amongst those who don't see a problem here without knowing the context--more than once I've done parsing where the input is one of a limited variety of possible things, any attribute that's unique to only one option is perfectly adequate in such a situation. Yeah, it puked on the .xps that was really a scanned image of the form rather than the computer-generated form, but so what? It was a fatal error no matter what, a more sophisticated parser would simply have been a bit more graceful about how it squawked.

  • konnichimade (unregistered) in reply to Dick Yates
    Comment held for moderation.
  • ismo (unregistered) in reply to Melissa U

    I was thinking the same way. As we do not know where the input comes this could be the cheapest way to veridy that the input really is xml and not some other (propietary) format.

    Sometimes what seems like WTF is really OK in that context, judging without context should not be donw.

  • bossie (unregistered)

    Looks very ad-hoc but for all we know it just might be Good Enough.

  • Industrial Automation Engineer (unregistered) in reply to dpm

    Your sarcasm detector is broken.

  • Richard A. O'Keefe (unregistered)
    Comment held for moderation.
  • Gnasher729 (unregistered)

    Here’s what you are supposed to do on iOS: you call a method that returns the mimetype. Doing that ensures that every single application agrees. So it won’t happen that one app thinks something is an image and another thinks it is executable code.

    And then if it has the right mime type then you through it into sn xml parser, and it will either get parsed or not.

  • Gnasher729 (unregistered) in reply to Randal L. Schwartz

    If it’s ASCII then there should be a line like encoding=ascii, encoded in ascii. That means nobody must ever create an encoding named ASCJJ that has the code points for I and j swapped :-)

  • (nodebb) in reply to Vilx-

    The unicode has already been parsed (because its a string) so it can't (be valid and) start with a BOM.

Leave a comment on “File Type Detection”

Log In or post as a guest

Replying to comment #:

« Return to Article