The Daily WTF: Curious Perversions in Information Technology

2024-09-09 Reply Admin

The number of web applications which need to generate PDFs is too damn high, in my experience.

Coincidentally enough, I just spent my Friday evening adding another one. Sorry.

I'm no expert in PDFs, but there's gotta be some way to just include space without spamming a bunch of paragraphs in the document.

I'm not usually an expert, but as my knowledge is at a local maximum due to Friday, you can add a bunch of vertical space between paragraphs with 0 distance Td. I don't, however, know how to do it in their PDF library or any PDF library, because I just wrote the PDF directly from PHP. And if that's TRWTF, then I don't care; my website, my masochistic development practices.

2024-09-09 Reply Admin

The PDF spec doesn't have paragraphs or any other form of "markup". It's just a bunch of objects positioned and painted in some way. Adobe also made the spec willingly and needlessly complex so it would be harder for potential competitors (and other software vendors) to implement their own parsers and renderers (probably also in order to sell more instances of their PDF library).

2024-09-09 Reply Admin

EVERY variable is referenced in the two lines immediacy following it...

2024-09-09 Reply Admin

The terrible part is that PDF had no concept of a paragraph. PDF documents are al elaborate container for a very simple interpreter that executes drawing commands. The paragraph structures are just to keep the library happy -- they (shouldn't) translate into anything in the PDF. As Smithers mentioned, there is a concept of the current location in PDF, but there are straightforward ways of moving it that do not involve creating useless paragraph structures.

2024-09-09 Reply Admin

Generating PDFs is always very library-style dependent.

Maybe their library didn't have set position function, but who knows. Missing for loop is definitely a WTF.

2024-09-09 Reply Admin

The number of web applications which need to generate PDFs is too damn high, in my experience.

And that will get worse. We are in an era of paperwork. The most influencial creation of the 20st century are neither nuclear weapons nor space satellites, but the idea that reality fits on printed A4 sheets (and Excel files. The number of web application which need to generate PDFs from Excel data is too damn high as well, in my experience).

2024-09-09 Reply Admin

My guess is that the library at hand is generating the PDF by "typesetting" the paragraphs to PDF (I wouldn't be surprised if it also had (potentially rudimentary) html/markdown/... to PDF functionality

2024-09-09 Reply Admin

PM: We need some space between these darn paragraphs. Mike, this is your top-priority!

Mike: Ok.

(one month later)

PM: We need more space between those paragraphs. Sue, drop all your work and get on it!

Sue: Ok, boss.

(repeat until Godot arrives)

2024-09-09 Reply Admin

IIRC PDF is derived from PostScript which is a page description language where things are placed on the page starting a a specific location and not just left to right, top-down.

Jaime · 2024-09-09 Reply Admin

Looks like this may be PDFSharp. If so, the entire block can be reduced this line of code applied to whatever paragraph is intended to com after to empty space:

nextParagrapgh.Format.SpaceBefore = Unit.FromPoint(120);

Gurth · 2024-09-09 Reply Admin

PDF documents are al elaborate container for a very simple interpreter that executes drawing commands.

This is because:—

PDF is derived from PostScript

Off the top of my head, the basics of it is that they stripped out the loops and branching that PS allows, added compression, and called it PDF 1.0.

And indeed, PS just puts stuff at specified locations on the page, with text being put there as single lines rather than multi-line paragraphs. This is why it’s so hard to edit text in a PDF even in applications that can do this (like Acrobat Pro, but also Illustrator, and whatever else can): you can change the text on a single line just fine, but it will not affect any subsequent lines at all, and if the text is right-justified it will almost certainly throw that off as well.

Worf · 2024-09-09 Reply Admin

Correct. PostScript is a programming language - it's Turing complete and can be used to write real programs in. It's way overkill for simply putting marks on a page. There are many demo programs you can send to a printer that would cause it to churn and then finally print the output.

Some demo programs include fractals, and there was a program that printed the Linux kernel module dependencies - you basically ran a script on the Linux kernel and it outputs a PostScript program that will read in each source file and generate the dependency chart on it. At the end comes out a page.

Of course, it's also an excellent way to tie up the printer for many hours.

PDF was created as a subset of PostScript where it got rid of most of the Turing-completeness and kept just the page description part of the language so it could efficiently describe a page.

You have to remember fonts are basically miniature programs - a PostScript font is a program that draws and fills in the glyphs so you tend to need the expressiveness of a Turing complete language. The TrueType font format is the same (and there have been many runtime-related TrueType font rendering security issues). PDF chooses not to embed the font directly and instead embeds the glyph output from the font. This way font licensing is simplified in that you cannot extract the embedded fonts from the PDF, but you could if it was a PostScript.

It also greatly simplifies display composition - Solaris implemented DisplayPostScript, while macOS implements DisplayPDF.

2024-09-10 Reply Admin

There are applications that can do it correctly, but unless you're in the prepress business and have to do this on a regular basis, the price is likely to put you off...

jeremypnet · 2024-09-10 Reply Admin

"Write a program to extract the text out of these PDFs" I was told. Naturally I checked out the PDF specification and then pretty much immediately decided to find a library to do it. PDFBox was my choice because the application was in Java.

PDFBox was able to easily pull all the text objects out of the page, but imagine my shock when I found that they weren't in any reliable order and, depending on the PDF generator used, even text strings that visually formed part of a contiguous line weren't necessarily one string, in the right order or even near each other in the file. In the end, I extracted all the text objects and sorted them by their coordinates and then assumed that any objects with the same vertical coordinates were on the same line. This didn't always work, of course, but it was good enough for my purposes.

Incidentally, it is perfectly legitimate for a PDF writer to render the text as an image and then encapsulate it in a PDF. Effectively the PDF has one command in it: "render this bitmap".

dkf · 2024-09-16 Reply Admin

I remember dealing with a program that wrote PDFs that did it by completely reprogramming the font mapping every few characters. The text you saw on the screen had no relation to the text data in the file, and copying the data produced incomprehensible nonsense. There are some terrible sins out there.

Take a Line Break

Leave a comment on “Take a Line Break”