- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Wtf! They finished on time!
Admin
Gern fantasy, didn't read.
Admin
So TRWTF is of course the fact that Tyler didn't write a generic-format importer in the first place, amirite?
Admin
Meh. The graduate student has an idea what it took. So does the Nobel laureate.
It would be the editor of the medical journal, employed by somebody like Elsevier, who thinks it's all magic.
Admin
I'm sure I've read this one before.
Admin
TRWTF is that Tyler didn't run at the first mention of "their system for distributing the schedules was a set of USB thumb-drives with self-hosting web apps"
Admin
How on earth do you write a generic-format importer? At some point you have to parse the input, and for that you need to know the format and how to read it?!
Admin
But why did thee lauerate open the app?
Admin
There are only so many formats in use. Use heuristics to figure out which format each entry is in, normalize the encoding for each, and output a
data Field = PDF Bytes | HTML Text | Plain Text
to work with and output data in a format you control.XML vs CSV is only a matter of a new parser (CSV is mercifully easier to read) and changing the reader method in use from a
XmlFile -> List RawField
toCsvFile -> List RawField
.Admin
CSV is mercifully easier to read: I wouldn't say that. At least not of if the file was generated with a "home-grown" CSV writer.
Level 1: "Very easy! You just separate the fields by a semicolon!" Level 2: "Ah yes: If the semicolon appears in a field itself you need to quote the whole text of the field." Level 3: "Hmmm... I think we forgot that the quote may also appear in the field."
Admin
Ah. True. CSV is definitely a lawless wasteland. You're right, the move away from XML was a bad move.
Admin
I hope those idiots got billed for a king's ransom for what they did.
Admin
WTF #1 - Distributing the agenda on a USB stick. Security nightmare! WTF #2 - Running a self-hosted web-app from a USB stick. WTF #3 - Notepad++ crashing on loading a file. Pics or it didn't happen!
Recommendation for anyone facing this in the future: Don't rely on a technical solution. This is a once-a-year process, where you need information entered manually in one format put into another format. Instead of paying a developer for weeks to write a parser, pay someone for one or two days to manually transcribe it into a sane format that the app can handle.
Admin
TRWTF is that poor Tyler imagines the graduate student using this webapp. I always just follow the crowd and extract the pdfs from the folder structure myself.
Admin
A parser for this is a day or two easily, and I hope you're not asking someone to manually re-encode each submission. Plus, when the conference gives you an updated dataset in a different format, you're going to have to pay that intern for another two days instead of just re-running the program on new data.
Admin
Let's see; generic heuristics rules:
You must also take into account different encoding, languages, date formats, number formats, currency formats, delimiters, binary data and fields containing encoded documents (e.g.: pdf, xml, ...).
Admin
This article needed more proofreading. Just sayin'.
Admin
That does indeed sound like a lot, but from the article, and for the purposes of display, it sounds like you'd only need to distinguish between base64-encoded PDFs and 8/16-bit encoded plain text or HTML. You wouldn't need to care whether a field was currency or a boolean or anything else because you're never going to do anything except display them.
Admin
Generic importer: Just translate the DNA sequence of a babelfish into binary
Admin
Csv and fixed column width textfiles should be forbidden. The amount of issues you can run into using these formats. Also bad: people that call csv files Excel files... if there is any piece of software that is able to screw up csv it is Excel. In my current job I am, alas, still confronted with these horrible formats on a very regular basis. With all its own issues I much prefer to work with xml, so switching from xml to csv is plain stupidity... the real wtf (combined with not telling the programmer about the change of course)
Admin
Uh..... This is a schedule of conference sessions, so there is content, reading material, DATE information and TIME information, as well as room numbers or other location info. You might argue that these are "for display only" -- and you are likely correct -- but these values certainly are not text and should not be treated as text. At the very least, the values might be (should be?) used by the web app to properly organize and sort sessions for each day.
OTOH, this is a great example of how SCIENCE works!
Admin
Oh, you're right, but that's fine, because those fields are not files being submitted by presenters, so they're (hopefully) going to be consistent. If not then you might have a bit of a tedious time, but at least it's not manually-differentiating-between-base64-and-html tedious.
(It's the same kind of argument I have for email addresses: 99% of developers should just treat them as arbitrary text, and the validation step should not be a mostly-wrong regex but to send the user a confirmation email.)
Admin
CSV is the worst… except for all the other tabular formats out there which manage to be worse than that in weird ways. Excel is the main piece of software that causes trouble CSV. Of course it is. And when there's been several layers of encoding, you're into trying to detect what sort of wooden table was placed under the printout that was photographed and sent to you, and that happens for real and it sucks so so much.
Real World Data. AAAAAAAAAaaaa!
Admin
Done that. Unfortunately, forgot to specify the binary format and it came out as a win16 binary. Had no machine to run it before it became aware of its misery and killed itself.
Admin
thee spellchecker hast thou failed entires
Admin
I was once asked about providing a functioning web site - complete with search engine - that could run from a CD.
Admin
I worked at a company where we processed thousands of files in different CSV dialects, TSV (or other separators) from 3rd party sources. Each source had their own format where fields would be ordered differently. I actually liked it and prefer it to JSON and XML. It is really fast to process and uses very little memory.
We did of course have a custom program that read a configuration file that mapped from the 3rd party format to an internal format, before we sent it to be processed. It would take 5-10 minutes to configure each format, but we didn't have any real issues with it.
Admin
Semicolons? Wouldn't that be SSV?
Admin
Generic importer - yeah right.
My first dev job I worked for a doctor who would always ask "how hard can it be?" before dumping some impossible task on me. It was a wonder to everyone who knew him that I put up with it for four years.
Admin
I don't think I've ever seen an article published with this many typos before...
Admin
The "generic" importer they want you to use is SSIS, where you do one hundred times the clicking for one tenth of the usability. After all, figuring out the row format and looping through them is just "too hard." I don't know which part I liked more, having to redo every single column if even a single database field changed in any way or receiving no useful debugging information when 389107 records in some dummy put "idunno" into a date field.
Did the doctor at least leave you to your own devices to perform those impossible tasks? Once you've worked for somebody who makes their handicaps yours, practically anything looks better.
Admin
CSV should stand (and in my mind, after working with it for about 10 times through my career it does) for "character separated values". because all exporters/importers I've ever seen have a setting for "character to use to separate values", and it's always a textfield, the semicolon is just its default value.
Admin
I got caught up trying to figure out what complex search/replace of He and She could lead to 'thee' and 'SHeeila'... Still can't work it out.
Admin
Did I miss something? If Statler and Waldorf had done this the year before, just tell them to use last year's parser, or provide it to be modified.
If they can't do that insist on agreeing formats and schemas for data transfer. Oh, unless some sales person had promised the impossible based on a hilariously fictitious time estimate and then given you a rubber-spined project manager to deal with.
And what industry standard changes from xml to csv nowadays? I'd have called bullshit and asked to see the standard "to ensure I get it right"
Admin
That's all very true if it's regular. This was a random blob of "text" hand-pasted, not in a uniform format and presumably not even in a uniform order, from someone else's emails. Good luck with that.
Admin
The data format had changed. Of course it had changed. Data formats given to you by hostile parties always change. And that's if they know what the format is in the first place; most of the time, as apparently this time, the format is "whatever order I dump my fields in from Outhouse, Excel and Word this time".
Admin
Ah wow, a CSV newbie! CSV is not, as you may think, fields separated by commas. It is instead fields separated by the field separator, which can vary by locale. For instance, in most of Europe, the locale specific field separator is semi-colon. If you give people in those locales comma separated fields, they won't load correctly into (eg) Excel without tweaking the import.
The reason for this is that in those locales, the decimal separator is in fact comma, so using semi-colon as the field separator allows less ambiguity in, eg, the record "10,4;foo"
Admin
No, if you describe something as comma-separated and it isn't you are committing a Statler and Waldorf type crime. It's delimiter separated and then tell the poor shit whatever set of characters you've used to delimit fields/records/pages/rabbit-holes. People who stick .csv on something not comma-delimited will be the first to the wall when the revolution comes. OK, second after whoever thought using a comma for decimal separator was a good idea.
Admin
Ah the memories, yes, I had a "project" to provide XML data to a guy with some shonky old piece of shit written on an ancient unix platform (the platform was fine, his coding wasn't). He'd chosen to handcraft his own XML parser, well he perhaps didn't have other options, but the thing simply couldn't cope with perfectly compliant XML (specifically, it couldn't recurse the hierarchy and also needed all tags predefined and in position if if no data was present), meaning he was trying to force me to down a road of writing my own pseudo-XML generator so I could produce the non-compliant garbage he needed. This all meant I was getting it in the neck because, apparently, the delay on the project was down to me not delivering.
I did, in a rare moment of political-astuteness manage to stitch him up a treat as he very obviously didn't really know what XML was about and that you can fairly easily demonstrate when something is valid (or not) even if your project manager is not a coder. I'd like to say "and he was swiftly moved to the testing team", but it was a rare victory against a guy who, when not creating technical debt for everyone else, was off scheming and plotting around the office. I got bored with the constant stream of integration projects with a fundamentally shit-and-getting-shitter-every-release legacy system and went off to another contract.
Admin
"Developer Dude (google)
Generic importer - yeah right.
My first dev job I worked for a doctor who would always ask "how hard can it be?" before dumping some impossible task on me. It was a wonder to everyone who knew him that I put up with it for four years."
When people do this to me, I show them some actual code. (Any code, doesn't have to actually be related. Something with lots of punctuation works well.) They normally boggle a bit and rethink.
Admin
Everyone keeps saying this, but it is apples and oranges wrt/ handling data complexity.
Hierarchical relationships are much more sensible and easier to read and parse in JSON and XML. With CSV files, it is of course possible to do the same, but it quickly becomes convoluted.
Hierarchical data naturally fits XML and JSON. Complex relationships between data fit XML and JSON.
They do not fit flat files, excel files, and/or CSV files, not without pain and suffering and gouging of the eyes...
Admin
OMG -- it's a conference. Just can't be that hard to create a website with a CMS back end and let the end user enter whatever data they want into it. Or take one of probably hundreds of open-source conference software and do the same.
And who in their right mind distributes the conference schedule on a thumb drive -- everyone I know that goes to a conference expects to be able to grab an app from the app store for the conference, and use it on their phone.
Aside from the expense of writing software for weeks; and physically copying the files onto thumb drives, the two developers and their manager are criminally negligent for charging their customer for essentially one-off, useless custom software when there are very inexpensive alternatives.
Admin
You think the consultants convinced them to do it that way? "We need you to make some software that shows the conference schedule on our attendees' computers. No, it can't be online because the wifi always goes down. Look, I don't care, this is our conference and we know how it should be done. Do you want the job or not?"
I'm at the point where I can tell them to take a hike, but I haven't always been in that position.
Admin
generic data import, ah yes... format conversion you call it? how do you know the relation between unrelated data in a generic importer,and be able to make sense... think you are confusing data import with data format conversion...
Admin
I'm personally more and more a fan of hdf5, especially for BIG dataset (e.g. thousends of millions of 2 byte samples).
Admin
"pay someone for one or two days to manually transcribe it into a sane format that the app can handle." As my supervisor at university put it: Never underestimate the typing speed of a good secretary.
Admin
First to the wall will be people who allowed VARIOUS field delimiters. And date formats. And different units like miles or kilometer to measure distances. I say, we should simply use "foo" for everything, the context will make clear what we are talking about.
"How far away is it?" - "Oh, about 10 1/2 foo." "Excuse me, can you tell me how late it is?" - "It's quarter past 10 foo." "Well, how many apples should I buy?" - "5 foo." "Wow, this CPU is really fast, what's the clock rate?" - "2.4 gigafoo."
It's really that simple!
Admin
Oh yeah? Go back and read the previous Gern article if you want to see typos. Guy just doesn't care.
It has now been nearly 5 months without any more Gern postings. Is our long national nightmare finally over?