The Daily WTF: Curious Perversions in Information Technology

Steve_The_Cynic · 2023-08-09 Reply Admin

It's worse than that, actually, since it will break anything that depends on the case of the attrributes in a tag. Like, say, that goofy server over there that provides the images for <img src="bla-di-bla"> tags and provides different images depending on the case of the tag (yes, that's a WTF as well, but...).

And @Remy, I hope your cat's problem is nothing too serious.

2023-08-09 Reply Admin

What about a base64 encoded image? [image]

Athanasius · 2023-08-09 Reply Admin

Or perhaps more pointedly, it would turn [image] into [image] potentially breaking any such linking if the images are hosted on a case-sensitive filesystem.

2023-08-09 Reply Admin

URLs are case-sensitive in general (except for the scheme and host name), so that's not a server WTF, but indeed a breakage of the code. If that's not enough, the same would apply e.g. to alt tags which you probably don't want to lower-case.

But of course, the main purpose of this code is to remind us why regexes are so evil. I mean, who could make any sense out of "s/(<.*?>)/\L$1/g" (untested) when we can have 46 lines of kind of readable code with just a few bugs, inefficiencies and misleading comments instead?

2023-08-09 Reply Admin

Sorry to be pedantic, but "s/(<[^>]*?>)/\L$1/g"

Remy Porter · 2023-08-09 Reply Admin

Thanks, the news is not great and will require some extreme care, but the prognosis is good if we can get care started soon- which we should be able to do.

2023-08-09 Reply Admin

As a JavaScript programmer I am triggered by the use of "let" as a variable name..

2023-08-09 Reply Admin

Surely none of the "case" or regex or attributes is the main problem. TRWTF is this induhvidual using "continue"s instead of "else if"s.

2023-08-09 Reply Admin

The code will also break easily when there's an attribute value that contains a >

Example: <div data-some-stuff=">">

TRWTF is parsing HTML manually like that.

They should have installed an HTML parser.

Heck, if it's an enterprise project. their codebase probably already has 5 different HTML parsers installed already!

Allexxann · 2023-08-09 Reply Admin

And simply using pandoc would make the whole thing unneeded in the first place.

Steve_The_Cynic · 2023-08-09 Reply Admin

@Athanasius that was more or less exactly my point.

@Foo AKA Fooo I wasn't sure exactly how far the case-insensitivity goes in URLs, but if it's part of the standards that the path is case sensitive, then it's a good solid WTF just for downcasing the entire contents of the tag. The alt text is, at that point, just collateral damage.

Also, re: regexes : "s/(<.*?>)/\L$1/g" (untested) is, indeed, not correct because it will downcase "THING" if a line of HTML contains <b>THING</b> (because .* is greedy...

2023-08-09 Reply Admin

@Steve The Cynic

In the case of "s/(<.?>)/\L$1/g" the "." is followed by a "?" which makes it non-greedy. Or whatever the correct term is.

The Beast in Black · 2023-08-09 Reply Admin

Sorry to hear about your cat, Remy. We recently got some not-good news about one of our cats. It's always difficult when this happens. I wish you and your furbaby all the very best <3

2023-08-09 Reply Admin

It's possible that this is a pre-processor for an existing HTML-to-PDF tool such as Flying Saucer. Flying Saucer operates only on XHTML as sons so much easier to parse. XHTML requires tags be in lower case.

Steve_The_Cynic · 2023-08-09 Reply Admin

Ugh. Thanks.

Jer · 2023-08-09 Reply Admin

"If this code is necessary at all, it's because something they're using to parse HTML is itself broken, and expecting a specific casing for tags."

Or a manager or customer who demands that tags be in lowercase for reasons.

Aesthetic requirements being added in places where they weren't necessary for code to function but where some internal requirement or personal preference was driving the choice happened more than once on projects I was on back in the day.

2023-08-09 Reply Admin

I was going to make the point that this breaks on ill-formed HTML, but yours is better because it breaks the same way on WELL-formed HTML.

2023-08-09 Reply Admin

Get well soon! =^_^=

2023-08-10 Reply Admin

There article says it's a PDF to HTML converter, but the method is handling RTF (rich text). Or is it?

2023-08-10 Reply Admin

This looks more like poor text to me. Or maybe desperate text.

2023-08-10 Reply Admin

Yeah, you get cats, you get used to the inevitable.

Had to see one of my little furbabies over the Rainbow Bridge the other week.

2023-08-10 Reply Admin

HTML cannot be properly parsed with a REGEX. Some tags (notably br and hr) don't have closing tags, while others don't require one. Additionally, tags can be nested infinitely deep and attributes may have a less than or greater than symbol in their value.

A company where I worked used a library from some big faceless org to convert a subset of HTML into a PDF. The idea was we would generate reports using this restrictive HTML subset and convert it to a PDF for the client.

Matt Hamilton · 2023-08-10 Reply Admin

I stared at "char let" for a like a minute wondering "what language is this?" before I realized it was just C# with a char variable named "let".

2023-08-11 Reply Admin

Please stop spreading this BS. We're not parsing HTML here, just recognizing tokens. (The same goes for many other languages, parsing is context-free or worse, but tokenizing is usually regular and often done with regex.)

Nested tags or tags without closing tags have absolutely no bearing on recognizing and lower-casing the tags. The only thing relevant of what you say may be <> within quotes. That however is regular because quotes cannot be nested. It's true, my regex doesn't do that, but neither does the original code. I leave it as an exercise to the reader to add a few characters to the regex to make it do that, as opposed to probably a few hundred lines of code and another StringBuilder to the original code.

2023-08-12 Reply Admin

I finally realized how one can get a variable named "let". It's short for "letter" - because in parsing HTML, one expects every character to be a letter!

jeremypnet · 2023-08-15 Reply Admin

The only thing relevant of what you say may be <> within quotes.

Nope. Raw < or > can appear in script and style tags.

Also, HTML comments, but since they will not be part of the output, they don't matter too much.

2023-08-17 Reply Admin

The only thing relevant of what you say may be <> within quotes.

Nope. Raw < or > can appear in script and style tags.

Nope. Sou Eu didn't say anything about < or > in script and style tags, therefore that isn't relevant to the statement "The only thing relevant of what you say ...".

2023-08-23 Reply Admin

TRWTF is rolling your own HTML to PDF converter instead of just printing to PDF from a (headless, if need be) browser.

A Case of Conversion

Leave a comment on “A Case of Conversion”