- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Everyone know you should check the file-extension first.
Admin
Whatever happened to the good old days of reading the entire file contents, then one-by-one checking if you can parse as JSON, CSV, XML, and Base64?
Admin
. . . and if there's no file involved? The code presented shows no such dependency, merely a test of a stream of characters.
Admin
This makes me want to do a pull request on the source for the Linux "file" command. Currently it requires the <?xml tag to recognize xml.
Admin
Text files can still have so many encodings, any detection like that is pretty as much a wild guess as are file extensions.
Admin
This is not necessarily a WTF. We don't know where the possible streams of data are coming from. Maybe it's already cleanr that it's some kind of structured and condensed data, either JSON, XML, or Linux property files, then it's totally okay to rely on the first character.
Admin
Again, encoding :-)
Admin
"Any string of text which starts with < is clearly an XML file."
I don't get it. ALL of my php files start with a "<", i.e. "<?php"
Admin
This could be acceptable if the input's already restricted. Say you're receiving inputs from another system that might emit data in any of a handful of different formats. If among those formats XML is the only one that starts with 0x3C, this seems like a reasonable and cheap test to decide whether to ask the parser to have a go at it.
A comment to that effect mould be nice though, in case one day the upstream system is extended to emit a conflicting format. Even if at that point you need a more sophisticated test to differentiate it from XML, it doesn't mean the straightforward approach was wrong initially.
Admin
"they at least trimed whitespace off"
Non-python coder detected. ;)
Admin
Not all XML files start with that. They probably should, but they don't, especially when they've got to be fed through someone's crufty home-rolled XML parser (that totally isn't just a bunch of regexps).
Admin
Base64 is an encoding not a format so wouldn't you have to decode the base 64 then try to parse the format :P
Admin
It's a minor wtf to call a variable
str
, as it's the Python string type. Or is Python TRWTF for using simple names likestr
,id
, and letting people redefine them?Admin
Well, the file could always start with a BOM...
Admin
I'm pretty sure that an XML parser would do just as good a job in just as fast a time, since it actually knows how an XML file is structured. Worst case, I suppose, is that it would read the entire file in before parsing it, but even that is unlikely to be a performance killer. Best case is, you actually get to use a parser that understands the file format ... which would include BOMs and so on.
Admin
Reading an entire file just to find out its type is a really terrible idea. In the vast majority of cases, you can get a good enough answer by examining a fixed initial part, which is O(1) rather than O(N) of reading it all.
And to answer the original question, all XML documents must start with '<' (apart from whitespace). The specification says that a document starts with 'prolog', which starts with optional XMLDecl ("<?xml", so starts with '<'), followed by optional 'Misc', which is a "Comment" ("<!--"...), "PI" ("<?"...), or "S" (whotespace); and then a "doctypedecl" (which is "<!doctype"...), so the first non-whitespace must always be "<".
Admin
And then good luck guessing whether it's ASCII, Latin-1, UTF-8, UTF-16 etc before you even start thinking about what "<" is. :) (Hint: UTF files can have BOM prefixes.)
Admin
Count me amongst those who don't see a problem here without knowing the context--more than once I've done parsing where the input is one of a limited variety of possible things, any attribute that's unique to only one option is perfectly adequate in such a situation. Yeah, it puked on the .xps that was really a scanned image of the form rather than the computer-generated form, but so what? It was a fatal error no matter what, a more sophisticated parser would simply have been a bit more graceful about how it squawked.
Admin
I was thinking the same way. As we do not know where the input comes this could be the cheapest way to veridy that the input really is xml and not some other (propietary) format.
Sometimes what seems like WTF is really OK in that context, judging without context should not be donw.
Admin
Looks very ad-hoc but for all we know it just might be Good Enough.
Admin
Your sarcasm detector is broken.
Admin
Here’s what you are supposed to do on iOS: you call a method that returns the mimetype. Doing that ensures that every single application agrees. So it won’t happen that one app thinks something is an image and another thinks it is executable code.
And then if it has the right mime type then you through it into sn xml parser, and it will either get parsed or not.
Admin
If it’s ASCII then there should be a line like encoding=ascii, encoded in ascii. That means nobody must ever create an encoding named ASCJJ that has the code points for I and j swapped :-)
Admin
The unicode has already been parsed (because its a string) so it can't (be valid and) start with a BOM.