- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
without excusing the WTFs, parsing an excel spreadsheet is a bit of a nightmare because they are stored sparsely and just finding the bottom-right cell is a non-trivial problem. We often had cases where there was a cell BZ20000 with no content but formatted bold, for instance, which generated a huge table when we tried to read the whole sheet into memory.
Admin
Somebody didn't understand how Excel manages rows, columns and values. I wonder why...
Shocked I am not.
Admin
"some messy thing that worked" on the one example they tried.
Admin
"[...] someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working."
Software development in a nutshell.
Admin
Extracting data from Excel spreadsheets is a fraught and thankless task, and this example isn't how to do it.
Just Say No kids.
Admin
"Since it worked, nobody looked at it until it stopped working." pretty well describes probably 90+% of the software out there.
Admin
Perhaps the spreadsheets being imported were generated by some piece of software, so the limits made sense. That is if using Excel as a data transfer mechanism ever made sense.
Admin
Huh. This is odd. I clearly remember that 10 years ago I was doing something with Excel myself, and internally it didn't use A/B/C, but rather neat numeric indexes. But now when I made a little test spreadsheet, it was all A1 this B2 that. Am I remembering wrong? No, it must be Excel who is wrong.
Admin
To be fair, that is probably the most correct way to do it when the alternatives are:
Admin
Extracting data from Excel is trivially easy, if you do it the way you're supposed to. File, Save As... CSV. You now have a CSV file which can be parsed using standard tools.
This really is quite the WTF, because they're reinventing the wheel and coming up with a pacman shape.
Admin
There's still the issue of formulas, but otherwise pretty much.
As to the columns problem, they could've also iterated through the columns via their numeric index. I'm not a Java guy, but in .NET both Excel Interop and OpenXML libraries support numeric column indexes so you can just use a simple loop with a counter.
Admin
edit: Didn't realize their library does not support column indexes. I feel like they should've chosen a better library.
Admin
You picked up the same major point I did. This piece of code is a classic example!
Admin
“All in all, this code reads like someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working.”
Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem..
Admin
To be fair I think Excel itself has similar problems.
Admin
There's also Apache POI, which has been pretty OK in my experience.
Admin
Excel does support numbered columns (in which case the reference format is R1C1, etc.), though with the method used in this submission, that probably wouldn't have been any better.
Admin
Excel supports both forms, A21 and R1C21 addressing. The default display for a sheet is an option. Internally it supports both simultaneously.
RC addressing also supports relative R[-1]C[+1]
Admin
The problem with save to .csv is that makes the user go through an extra step that might get forgotten. Parsing an excel spreadsheet is actually easy--let Microsoft do it for you. They have a .DLL that lets you deal with it in a sane fashion. No appreciable hoop-jumping, the array is sparse so the one wild cell doesn't make a huge in-memory footprint. My current work involves a configuration that's a 2mb spreadsheet--it's an awful lot easier to simply use Excel's capability than write a practical editor for all that data--especially since much of the data is in the form of a grid. (A great big matrix of what goes with what.)
Admin
10000 rows ought to be enough for anyone.
Admin
"Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem.."
I don't know who "they" are, but no one is selling that as the proper way. It's just that most devs fall into that way of doing things because it's the easiest and requires the least thinking.
And if you think that's what Agile is, you're wrong.
Admin
In principle, that is not what Agile is. In practice, though, that's the result of Agile.
Admin
You can toggle between A1-style references and R1C1 references. The latter are more verbose and confusing, but far more powerful and objectively better. 😊
Admin
"The problem with save to .csv is that makes the user go through an extra step"
No, you can automate that.
Admin
That as well - and Excel's handling of CSV is locale-dependent, which is its own whole WTF.
Please explain to my Linux-based PHP application how to use that...
Admin
Yeah, thanks, wish I could, but I'm ETL.
Admin
Easy. Just fix the two huge problems you have there. Linux and PHP.
Admin
Remember when Britain forgot to tell people about covid-positive results, because saving to .XLS instead of .XLSX silently truncates to roughly 32000 rows? I think the same (or similar) issue happened in several countries. Give people a spreadsheet and they will use it as a database.
Admin
I am now officially guilty of this. I simply can't be bothered to test all possible cases, when we have no automated test examples. I have to deliver yesterday, fixing bugs later is acceptable, and the code base is messy enough to require replicating dozens of instances of conditional logic ("different types", but implemented wrongly) in many places.
Did I mention that the lack of automated tests makes refactoring and the subsequently necessary testing a job, that likely takes many many months for every single such misdesign?
Admin
If you use a good excel library there should be functions to find the area of non empy cells and that excludes formated cells and only looks at the actual content.
Admin
I once spent many hours struggling with why a system I was working on was displaying a record with null in a certain field despite the fact that the query clearly said "is not null". It was quite a while before I finally figured out that the field value wasn't null, but the four letters "n" - "u" - "l" - "l".