The Daily WTF: Curious Perversions in Information Technology

2021-07-14 Reply Admin

without excusing the WTFs, parsing an excel spreadsheet is a bit of a nightmare because they are stored sparsely and just finding the bottom-right cell is a non-trivial problem. We often had cases where there was a cell BZ20000 with no content but formatted bold, for instance, which generated a huge table when we tried to read the whole sheet into memory.

2021-07-14 Reply Admin

Somebody didn't understand how Excel manages rows, columns and values. I wonder why...

Shocked I am not.

2021-07-14 Reply Admin

"some messy thing that worked" on the one example they tried.

nerd4sale · 2021-07-14 Reply Admin

"[...] someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working."

Software development in a nutshell.

2021-07-14 Reply Admin

Extracting data from Excel spreadsheets is a fraught and thankless task, and this example isn't how to do it.

Just Say No kids.

2021-07-14 Reply Admin

"Since it worked, nobody looked at it until it stopped working." pretty well describes probably 90+% of the software out there.

2021-07-14 Reply Admin

Perhaps the spreadsheets being imported were generated by some piece of software, so the limits made sense. That is if using Excel as a data transfer mechanism ever made sense.

2021-07-14 Reply Admin

Huh. This is odd. I clearly remember that 10 years ago I was doing something with Excel myself, and internally it didn't use A/B/C, but rather neat numeric indexes. But now when I made a little test spreadsheet, it was all A1 this B2 that. Am I remembering wrong? No, it must be Excel who is wrong.

2021-07-14 Reply Admin

parsing an excel spreadsheet is a bit of a nightmare because they are stored sparsely and just finding the bottom-right cell is a non-trivial problem

To be fair, that is probably the most correct way to do it when the alternatives are:

Store every single empty cell within the "bounding box". (Hello n^2 requirements.)
Either store everything or store sparsely, based on some heuristic. (Now you have two problems.)

2021-07-14 Reply Admin

Extracting data from Excel is trivially easy, if you do it the way you're supposed to. File, Save As... CSV. You now have a CSV file which can be parsed using standard tools.

This really is quite the WTF, because they're reinventing the wheel and coming up with a pacman shape.

2021-07-14 Reply Admin

There's still the issue of formulas, but otherwise pretty much.

As to the columns problem, they could've also iterated through the columns via their numeric index. I'm not a Java guy, but in .NET both Excel Interop and OpenXML libraries support numeric column indexes so you can just use a simple loop with a counter.

2021-07-14 Reply Admin

edit: Didn't realize their library does not support column indexes. I feel like they should've chosen a better library.

2021-07-14 Reply Admin

You picked up the same major point I did. This piece of code is a classic example!

2021-07-14 Reply Admin

“All in all, this code reads like someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working.”

Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem..

konnichimade · 2021-07-14 Reply Admin

To be fair I think Excel itself has similar problems.

2021-07-14 Reply Admin

There's also Apache POI, which has been pretty OK in my experience.

2021-07-14 Reply Admin

Excel does support numbered columns (in which case the reference format is R1C1, etc.), though with the method used in this submission, that probably wouldn't have been any better.

2021-07-14 Reply Admin

Excel supports both forms, A21 and R1C21 addressing. The default display for a sheet is an option. Internally it supports both simultaneously.

RC addressing also supports relative R[-1]C[+1]

2021-07-14 Reply Admin

The problem with save to .csv is that makes the user go through an extra step that might get forgotten. Parsing an excel spreadsheet is actually easy--let Microsoft do it for you. They have a .DLL that lets you deal with it in a sane fashion. No appreciable hoop-jumping, the array is sparse so the one wild cell doesn't make a huge in-memory footprint. My current work involves a configuration that's a 2mb spreadsheet--it's an awful lot easier to simply use Excel's capability than write a practical editor for all that data--especially since much of the data is in the form of a grid. (A great big matrix of what goes with what.)

2021-07-14 Reply Admin

10000 rows ought to be enough for anyone.

2021-07-14 Reply Admin

"Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem.."

I don't know who "they" are, but no one is selling that as the proper way. It's just that most devs fall into that way of doing things because it's the easiest and requires the least thinking.

And if you think that's what Agile is, you're wrong.

2021-07-14 Reply Admin

In principle, that is not what Agile is. In practice, though, that's the result of Agile.

2021-07-15 Reply Admin

You can toggle between A1-style references and R1C1 references. The latter are more verbose and confusing, but far more powerful and objectively better. 😊

2021-07-15 Reply Admin

"The problem with save to .csv is that makes the user go through an extra step"

No, you can automate that.

2021-07-15 Reply Admin

The problem with save to .csv is that makes the user go through an extra step that might get forgotten.

That as well - and Excel's handling of CSV is locale-dependent, which is its own whole WTF.

Parsing an excel spreadsheet is actually easy--let Microsoft do it for you. They have a .DLL that lets you deal with it in a sane fashion.

Please explain to my Linux-based PHP application how to use that...

2021-07-15 Reply Admin

Yeah, thanks, wish I could, but I'm ETL.

2021-07-16 Reply Admin

Easy. Just fix the two huge problems you have there. Linux and PHP.

R3D3 · 2021-07-16 Reply Admin

10000 rows ought to be enough for anyone.

Remember when Britain forgot to tell people about covid-positive results, because saving to .XLS instead of .XLSX silently truncates to roughly 32000 rows? I think the same (or similar) issue happened in several countries. Give people a spreadsheet and they will use it as a database.

R3D3 · 2021-07-16 Reply Admin

"some messy thing that worked" on the one example they tried.

I am now officially guilty of this. I simply can't be bothered to test all possible cases, when we have no automated test examples. I have to deliver yesterday, fixing bugs later is acceptable, and the code base is messy enough to require replicating dozens of instances of conditional logic ("different types", but implemented wrongly) in many places.

Did I mention that the lack of automated tests makes refactoring and the subsequently necessary testing a job, that likely takes many many months for every single such misdesign?

2021-07-16 Reply Admin

If you use a good excel library there should be functions to find the area of non empy cells and that excludes formated cells and only looks at the actual content.

2021-07-22 Reply Admin

I once spent many hours struggling with why a system I was working on was displaying a record with null in a certain field despite the fact that the query clearly said "is not null". It was quite a while before I finally figured out that the field value wasn't null, but the four letters "n" - "u" - "l" - "l".

A Parser Par Excellence

Leave a comment on “A Parser Par Excellence”