• I am not robot (unregistered)

    without excusing the WTFs, parsing an excel spreadsheet is a bit of a nightmare because they are stored sparsely and just finding the bottom-right cell is a non-trivial problem. We often had cases where there was a cell BZ20000 with no content but formatted bold, for instance, which generated a huge table when we tried to read the whole sheet into memory.

  • Thomas (unregistered)

    Somebody didn't understand how Excel manages rows, columns and values. I wonder why...

    Shocked I am not.

  • D J Hemming (unregistered)

    "some messy thing that worked" on the one example they tried.

  • (nodebb)

    "[...] someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working."

    Software development in a nutshell.

  • Steve (unregistered)

    Extracting data from Excel spreadsheets is a fraught and thankless task, and this example isn't how to do it.

    Just Say No kids.

  • Scott (unregistered)

    "Since it worked, nobody looked at it until it stopped working." pretty well describes probably 90+% of the software out there.

  • my name is missing (unregistered)

    Perhaps the spreadsheets being imported were generated by some piece of software, so the limits made sense. That is if using Excel as a data transfer mechanism ever made sense.

  • Vilx- (unregistered)

    Huh. This is odd. I clearly remember that 10 years ago I was doing something with Excel myself, and internally it didn't use A/B/C, but rather neat numeric indexes. But now when I made a little test spreadsheet, it was all A1 this B2 that. Am I remembering wrong? No, it must be Excel who is wrong.

  • Chronomium (unregistered) in reply to I am not robot

    parsing an excel spreadsheet is a bit of a nightmare because they are stored sparsely and just finding the bottom-right cell is a non-trivial problem

    To be fair, that is probably the most correct way to do it when the alternatives are:

    1. Store every single empty cell within the "bounding box". (Hello n^2 requirements.)
    2. Either store everything or store sparsely, based on some heuristic. (Now you have two problems.)
  • Dave (unregistered) in reply to Steve

    Extracting data from Excel is trivially easy, if you do it the way you're supposed to. File, Save As... CSV. You now have a CSV file which can be parsed using standard tools.

    This really is quite the WTF, because they're reinventing the wheel and coming up with a pacman shape.

  • Jay (unregistered) in reply to Dave

    There's still the issue of formulas, but otherwise pretty much.

    As to the columns problem, they could've also iterated through the columns via their numeric index. I'm not a Java guy, but in .NET both Excel Interop and OpenXML libraries support numeric column indexes so you can just use a simple loop with a counter.

  • Jay (unregistered)

    edit: Didn't realize their library does not support column indexes. I feel like they should've chosen a better library.

  • Indeed (unregistered) in reply to nerd4sale

    You picked up the same major point I did. This piece of code is a classic example!

  • Mark (unregistered)

    “All in all, this code reads like someone didn't fully understand the problem they were trying to solve, hacked at it until they got some messy thing that worked, committed it and called it a day. Since it worked, nobody looked at it until it stopped working.”

    Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem..

  • (nodebb) in reply to I am not robot

    To be fair I think Excel itself has similar problems.

  • Angela Anuszewski (google) in reply to Jay

    There's also Apache POI, which has been pretty OK in my experience.

  • Meir (unregistered) in reply to Vilx-

    Excel does support numbered columns (in which case the reference format is R1C1, etc.), though with the method used in this submission, that probably wouldn't have been any better.

  • Anon E. Mouse (unregistered) in reply to Vilx-

    Excel supports both forms, A21 and R1C21 addressing. The default display for a sheet is an option. Internally it supports both simultaneously.

    RC addressing also supports relative R[-1]C[+1]

  • Loren Pechtel (unregistered) in reply to Dave

    The problem with save to .csv is that makes the user go through an extra step that might get forgotten. Parsing an excel spreadsheet is actually easy--let Microsoft do it for you. They have a .DLL that lets you deal with it in a sane fashion. No appreciable hoop-jumping, the array is sparse so the one wild cell doesn't make a huge in-memory footprint. My current work involves a configuration that's a 2mb spreadsheet--it's an awful lot easier to simply use Excel's capability than write a practical editor for all that data--especially since much of the data is in the form of a grid. (A great big matrix of what goes with what.)

  • Bill Gates (unregistered)

    10000 rows ought to be enough for anyone.

  • xtal256 (unregistered) in reply to Mark

    "Isn’t that what they sell as the proper way? Do the simplest thing that could possibly work, and once it doesn’t work anymore, it’s someone else’s problem.."

    I don't know who "they" are, but no one is selling that as the proper way. It's just that most devs fall into that way of doing things because it's the easiest and requires the least thinking.

    And if you think that's what Agile is, you're wrong.

  • mushroom farm (unregistered) in reply to xtal256

    In principle, that is not what Agile is. In practice, though, that's the result of Agile.

  • goldcrest25 (unregistered) in reply to Vilx-

    You can toggle between A1-style references and R1C1 references. The latter are more verbose and confusing, but far more powerful and objectively better. 😊

  • Dave (unregistered) in reply to Loren Pechtel

    "The problem with save to .csv is that makes the user go through an extra step"

    No, you can automate that.

  • RLB (unregistered) in reply to Loren Pechtel

    The problem with save to .csv is that makes the user go through an extra step that might get forgotten.

    That as well - and Excel's handling of CSV is locale-dependent, which is its own whole WTF.

    Parsing an excel spreadsheet is actually easy--let Microsoft do it for you. They have a .DLL that lets you deal with it in a sane fashion.

    Please explain to my Linux-based PHP application how to use that...

  • Barf4Eva (unregistered) in reply to Steve

    Yeah, thanks, wish I could, but I'm ETL.

  • 516052 (unregistered) in reply to RLB

    Easy. Just fix the two huge problems you have there. Linux and PHP.

  • (nodebb) in reply to Bill Gates

    10000 rows ought to be enough for anyone.

    Remember when Britain forgot to tell people about covid-positive results, because saving to .XLS instead of .XLSX silently truncates to roughly 32000 rows? I think the same (or similar) issue happened in several countries. Give people a spreadsheet and they will use it as a database.

  • (nodebb) in reply to D J Hemming

    "some messy thing that worked" on the one example they tried.

    I am now officially guilty of this. I simply can't be bothered to test all possible cases, when we have no automated test examples. I have to deliver yesterday, fixing bugs later is acceptable, and the code base is messy enough to require replicating dozens of instances of conditional logic ("different types", but implemented wrongly) in many places.

    Did I mention that the lack of automated tests makes refactoring and the subsequently necessary testing a job, that likely takes many many months for every single such misdesign?

  • David Mårtensson (unregistered) in reply to I am not robot

    If you use a good excel library there should be functions to find the area of non empy cells and that excludes formated cells and only looks at the actual content.

  • jay (unregistered)

    I once spent many hours struggling with why a system I was working on was displaying a record with null in a certain field despite the fact that the query clearly said "is not null". It was quite a while before I finally figured out that the field value wasn't null, but the four letters "n" - "u" - "l" - "l".

Leave a comment on “A Parser Par Excellence”

Log In or post as a guest

Replying to comment #:

« Return to Article