• Industrial Automation Engineer (unregistered)

    Since the requirements were absent, my first thought was that someone wanted to convert a PDF containing an image of a document, photographed on a wooden table, and complained that it didn't work. The actual WTF is even worse (than failure.)

  • WTFGuy (unregistered)

    Back in the early days of .NET, there were a LOT more command line utilities one could buy or download than there were reliable .NET-compatible libraries one could buy or download. Shelling out as interop was pretty much de rigueur in those days.

    LOTs of code got written for .NET v1.0 & v1.1 that is still running today. Should it be modernized? Sure. Has it been? Usually not. Other than, as this sample indicates, the minimum patchwork needed to, e.g. , update a build path as the compiler version has slowly incremented from VS2003 to VS2022.

  • (nodebb)

    TRWTF is deploying on a Friday.

  • Prime Mover (unregistered)

    But did GRH finish the job of fixing this before Monday? We need to be told.

  • RLB (unregistered)

    Three quotation marks in a string are not one quotation mark in the output. Three quotation marks are two quotation marks being one in the output, plus one terminating the string.

  • MaxiTB (unregistered)

    PDF to text drops microphone

  • MaxiTB (unregistered) in reply to Dragnslcr

    Not uncommon, not unreasonable actually. In fact I'd go so far that if you can't deploy into prod when no dev is around than there is something very, very wrong with your ALM in general ;-)

  • (nodebb) in reply to Dragnslcr

    We deploy 7 days a week, all "shifts" round the clock...It is just a matter of investing in strict engineering principles for the entire software/hardware/ecosystem/environment. Alas, few seem to be willing or able... NOTE: I am NOT suggesting that this investment is worthwhile in all (or even most) cases...but IMPO a competent professional should be able to understand and if needed implement this level of requitements.

  • Sauron (unregistered)

    """Converting a PDF file into plain text""" ?

    So many things can go wrong with just that very idea, so that's also a WTF in and of itself!

    Some PDF files don't even contain the text as text, but some symbolic representation and a symbolic-representation-to-character-image mapping table.

    So, converting the content an arbitrary PDF to text may require an OCR program (or a human + a keyboard).

    PDF-to-TXT is hardly better than Web 0.1: https://thedailywtf.com/articles/Web_0_0x2e_1

    At best it's Web 0.2

  • (nodebb) in reply to Sauron

    It depends on the context. If your program processes PDFs from one source to get one region of text from an expected page number, then it's a reasonable choice. Just because this looks like code for a generic pdf-to-text function doesn't mean they really expected to handle every single PDF on the planet.

    For example, at one of my past jobs, we had to print a quarterly financial report from one of the big investment firms and mail it to hundreds of individual report recipiients.. They delivered us a single PDF with a few thousand pages -- where every 3 to 10 pages needed to be mailed to a different recipient. No PDF index was provided. When I got there humans had to spend many many man-hours pulling individual pages off a stack to stuff them into envelopes.

    But the PDF came from a single source, always using the same PDF structure and formatting.

    pdftoxml could extract all the text from the PDF, with handy XML tags showing the page breaks. Examining the pages to find the one with no page number gave me the first page of each set. Output the page number of those first-of-set pages and I have an index. (The recipient name could also be extracted from a predictable place in the XML, so the index was usable as a lookup or checklist.) Use those page numbers with a tool to extract pages from a pdf to another pdf, and I have broken up the huge document into individual one-destination sets. And set the high speed printer to offset each pdf as it comes out of the printer, and I've just made it so the humans stuffing the envelopes can almost keep up with the high speed printer! It cut the turnaround time from about five days employing a dozen people to about two days employing five people.

    Lesson: to prevent garbage out, be very careful that what you feed in is not garbage.

  • D (unregistered)

    You mean two double quote chars, right?

  • Gearhead (unregistered)

    Submitter here. My frist front page post!

    You will not be surprised to learn there are multiple functions within this program that perform a pdf to txt conversion in slightly different ways. Sometimes PdfSharpCore is used to strip annotations, except when it isn't. Some document types are given unique (uuid) filenames, but other document types use the exact same filename every time. When the program crashes all the old text files are left in the temp directory. But hey, let's just check if a txt file exists. What could go wrong.

    But did GRH finish the job of fixing this before Monday? We need to be told.

    Thanks for asking! The fat client was up, but the Citrix version was down for a week. And I still don't know why the deploy behavior changed. My suspicion is related to a Visual Studio version change. As to why we use ClickOnce with Citrix, well, yeah.

  • Gearhead (unregistered)

    """Converting a PDF file into plain text""" ? So many things can go wrong with just that very idea, so that's also a WTF in and of itself!

    Indeed. Matlab is one of the PDF types this program consumes. One of our users reported the following just this week:

    It turns out the default Matlab renderer has a limit of pixels/vector points for a vectorized PDF. Above the limit, it defaults to a bitmap image. Changing the default renderer increased the vector point limit and it now exports fine.

  • (nodebb) in reply to TheCPUWizard

    Nope, I don't care how strict your engineering principles are. There's always a risk that something unexpected will happen in the production environment, and nobody will be happy if they have to work over a weekend that they weren't planning on working.

  • MaxiTB (unregistered) in reply to Dragnslcr

    If you cant rollback a deployment with one simple command line or one click in whatever deployment toolchain web UI you are using, then there is something were wrong (or unprofessional) about your deployment pipeline ;-) As in nobody has to "come in for work" beside the already on site ops team.

  • MaxiTB (unregistered) in reply to WTFGuy

    huh? Roslyn. huh?

  • WTFGuy (unregistered)

    Please explain. I know what Roslyn was/is. Beyond that I can't tell if you're joking or serious, nor what your point might be. I'm not trying to be combative; I'm just genuinely baffled.

  • (nodebb) in reply to D

    I have similar complaints about the escape character in C# - oh yes, let's use a directory delimiter On the other hand, VB.Net has had ControlChars.Quote since.... the beginning?

  • (nodebb)

    Double double-quote characters is a very very very old convention in BASIC strings but was never universal (many BASIC variations did not supporting escaping of any kind). I'm fairly sure it predates Windows. It may even predate MS-DOS. It just might even predate C.

  • Concept14 (unregistered)

    And what happens when a file name contains ".pdf" somewhere besides the end?

  • (nodebb) in reply to staticsan

    I actually can't recall any BASIC version that doesn't use double quotes. Commodore 64 BASIC for sure had double quotes, so did Microsoft's Amiga BASIC. Obviously Q-BASIC too and I can remember a lot of more exotic interpreters and compilers using double quotes as an escape character in strings for, well, double quotes. C-style escaping was more common with C versions like C, C++ and C#. Can't remember how it was done with the PASCAL family, COBOL and other niche languages specifically, but doubling string terminator characters was in general very common in the 80s and before.

  • (nodebb)

    "Double the quote character inside a string to represent an embedded quote character" is really quite a common method, used by SQL and by CSV files among others; I personally prefer this method over using an escape character, but it's a very trivial thing to get upset about. Only get upset with environments that don't give you either option (I'm looking at you, Informatica).

Leave a comment on “How To Ruin a Long Weekend”

Log In or post as a guest

Replying to comment #:

« Return to Article