- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- It Figures
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Since the requirements were absent, my first thought was that someone wanted to convert a PDF containing an image of a document, photographed on a wooden table, and complained that it didn't work. The actual WTF is even worse (than failure.)
Admin
Back in the early days of .NET, there were a LOT more command line utilities one could buy or download than there were reliable .NET-compatible libraries one could buy or download. Shelling out as interop was pretty much de rigueur in those days.
LOTs of code got written for .NET v1.0 & v1.1 that is still running today. Should it be modernized? Sure. Has it been? Usually not. Other than, as this sample indicates, the minimum patchwork needed to, e.g. , update a build path as the compiler version has slowly incremented from VS2003 to VS2022.
Admin
TRWTF is deploying on a Friday.
Admin
But did GRH finish the job of fixing this before Monday? We need to be told.
Admin
Three quotation marks in a string are not one quotation mark in the output. Three quotation marks are two quotation marks being one in the output, plus one terminating the string.
Admin
PDF to text drops microphone
Admin
Not uncommon, not unreasonable actually. In fact I'd go so far that if you can't deploy into prod when no dev is around than there is something very, very wrong with your ALM in general ;-)
Admin
We deploy 7 days a week, all "shifts" round the clock...It is just a matter of investing in strict engineering principles for the entire software/hardware/ecosystem/environment. Alas, few seem to be willing or able... NOTE: I am NOT suggesting that this investment is worthwhile in all (or even most) cases...but IMPO a competent professional should be able to understand and if needed implement this level of requitements.
Admin
"""Converting a PDF file into plain text""" ?
So many things can go wrong with just that very idea, so that's also a WTF in and of itself!
Some PDF files don't even contain the text as text, but some symbolic representation and a symbolic-representation-to-character-image mapping table.
So, converting the content an arbitrary PDF to text may require an OCR program (or a human + a keyboard).
PDF-to-TXT is hardly better than Web 0.1: https://thedailywtf.com/articles/Web_0_0x2e_1
At best it's Web 0.2
Admin
It depends on the context. If your program processes PDFs from one source to get one region of text from an expected page number, then it's a reasonable choice. Just because this looks like code for a generic pdf-to-text function doesn't mean they really expected to handle every single PDF on the planet.
For example, at one of my past jobs, we had to print a quarterly financial report from one of the big investment firms and mail it to hundreds of individual report recipiients.. They delivered us a single PDF with a few thousand pages -- where every 3 to 10 pages needed to be mailed to a different recipient. No PDF index was provided. When I got there humans had to spend many many man-hours pulling individual pages off a stack to stuff them into envelopes.
But the PDF came from a single source, always using the same PDF structure and formatting.
pdftoxml could extract all the text from the PDF, with handy XML tags showing the page breaks. Examining the pages to find the one with no page number gave me the first page of each set. Output the page number of those first-of-set pages and I have an index. (The recipient name could also be extracted from a predictable place in the XML, so the index was usable as a lookup or checklist.) Use those page numbers with a tool to extract pages from a pdf to another pdf, and I have broken up the huge document into individual one-destination sets. And set the high speed printer to offset each pdf as it comes out of the printer, and I've just made it so the humans stuffing the envelopes can almost keep up with the high speed printer! It cut the turnaround time from about five days employing a dozen people to about two days employing five people.
Lesson: to prevent garbage out, be very careful that what you feed in is not garbage.
Admin
You mean two double quote chars, right?
Admin
Submitter here. My frist front page post!
You will not be surprised to learn there are multiple functions within this program that perform a pdf to txt conversion in slightly different ways. Sometimes PdfSharpCore is used to strip annotations, except when it isn't. Some document types are given unique (uuid) filenames, but other document types use the exact same filename every time. When the program crashes all the old text files are left in the temp directory. But hey, let's just check if a txt file exists. What could go wrong.
Thanks for asking! The fat client was up, but the Citrix version was down for a week. And I still don't know why the deploy behavior changed. My suspicion is related to a Visual Studio version change. As to why we use ClickOnce with Citrix, well, yeah.
Admin
Indeed. Matlab is one of the PDF types this program consumes. One of our users reported the following just this week:
Admin
Nope, I don't care how strict your engineering principles are. There's always a risk that something unexpected will happen in the production environment, and nobody will be happy if they have to work over a weekend that they weren't planning on working.
Admin
If you cant rollback a deployment with one simple command line or one click in whatever deployment toolchain web UI you are using, then there is something were wrong (or unprofessional) about your deployment pipeline ;-) As in nobody has to "come in for work" beside the already on site ops team.
Admin
huh? Roslyn. huh?
Admin
Please explain. I know what Roslyn was/is. Beyond that I can't tell if you're joking or serious, nor what your point might be. I'm not trying to be combative; I'm just genuinely baffled.
Admin
I have similar complaints about the escape character in C# - oh yes, let's use a directory delimiter On the other hand, VB.Net has had ControlChars.Quote since.... the beginning?
Admin
Double double-quote characters is a very very very old convention in BASIC strings but was never universal (many BASIC variations did not supporting escaping of any kind). I'm fairly sure it predates Windows. It may even predate MS-DOS. It just might even predate C.
Admin
And what happens when a file name contains ".pdf" somewhere besides the end?
Admin
I actually can't recall any BASIC version that doesn't use double quotes. Commodore 64 BASIC for sure had double quotes, so did Microsoft's Amiga BASIC. Obviously Q-BASIC too and I can remember a lot of more exotic interpreters and compilers using double quotes as an escape character in strings for, well, double quotes. C-style escaping was more common with C versions like C, C++ and C#. Can't remember how it was done with the PASCAL family, COBOL and other niche languages specifically, but doubling string terminator characters was in general very common in the 80s and before.
Admin
"Double the quote character inside a string to represent an embedded quote character" is really quite a common method, used by SQL and by CSV files among others; I personally prefer this method over using an escape character, but it's a very trivial thing to get upset about. Only get upset with environments that don't give you either option (I'm looking at you, Informatica).