- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
It's worse than that, actually, since it will break anything that depends on the case of the attrributes in a tag. Like, say, that goofy server over there that provides the images for
<img src="bla-di-bla">
tags and provides different images depending on the case of the tag (yes, that's a WTF as well, but...).And @Remy, I hope your cat's problem is nothing too serious.
Admin
What about a base64 encoded image? [image]
Admin
Or perhaps more pointedly, it would turn [image] into [image] potentially breaking any such linking if the images are hosted on a case-sensitive filesystem.
Admin
URLs are case-sensitive in general (except for the scheme and host name), so that's not a server WTF, but indeed a breakage of the code. If that's not enough, the same would apply e.g. to alt tags which you probably don't want to lower-case.
But of course, the main purpose of this code is to remind us why regexes are so evil. I mean, who could make any sense out of "s/(<.*?>)/\L$1/g" (untested) when we can have 46 lines of kind of readable code with just a few bugs, inefficiencies and misleading comments instead?
Admin
Sorry to be pedantic, but "s/(<[^>]*?>)/\L$1/g"
Admin
Thanks, the news is not great and will require some extreme care, but the prognosis is good if we can get care started soon- which we should be able to do.
Admin
As a JavaScript programmer I am triggered by the use of "let" as a variable name..
Admin
Surely none of the "case" or regex or attributes is the main problem. TRWTF is this induhvidual using "continue"s instead of "else if"s.
Admin
The code will also break easily when there's an attribute value that contains a
>
Example:
<div data-some-stuff=">">
TRWTF is parsing HTML manually like that.
They should have installed an HTML parser.
Heck, if it's an enterprise project. their codebase probably already has 5 different HTML parsers installed already!
Admin
And simply using pandoc would make the whole thing unneeded in the first place.
Admin
@Athanasius that was more or less exactly my point.
@Foo AKA Fooo I wasn't sure exactly how far the case-insensitivity goes in URLs, but if it's part of the standards that the path is case sensitive, then it's a good solid WTF just for downcasing the entire contents of the tag. The alt text is, at that point, just collateral damage.
Also, re: regexes :
"s/(<.*?>)/\L$1/g" (untested)
is, indeed, not correct because it will downcase "THING" if a line of HTML contains<b>THING</b>
(because.*
is greedy...Admin
@Steve The Cynic
In the case of "s/(<.?>)/\L$1/g" the "." is followed by a "?" which makes it non-greedy. Or whatever the correct term is.
Admin
Sorry to hear about your cat, Remy. We recently got some not-good news about one of our cats. It's always difficult when this happens. I wish you and your furbaby all the very best <3
Admin
It's possible that this is a pre-processor for an existing HTML-to-PDF tool such as Flying Saucer. Flying Saucer operates only on XHTML as sons so much easier to parse. XHTML requires tags be in lower case.
Admin
Ugh. Thanks.
Admin
"If this code is necessary at all, it's because something they're using to parse HTML is itself broken, and expecting a specific casing for tags."
Or a manager or customer who demands that tags be in lowercase for reasons.
Aesthetic requirements being added in places where they weren't necessary for code to function but where some internal requirement or personal preference was driving the choice happened more than once on projects I was on back in the day.
Admin
I was going to make the point that this breaks on ill-formed HTML, but yours is better because it breaks the same way on WELL-formed HTML.
Admin
Get well soon! =^_^=
Admin
There article says it's a PDF to HTML converter, but the method is handling RTF (rich text). Or is it?
Admin
This looks more like poor text to me. Or maybe desperate text.
Admin
Yeah, you get cats, you get used to the inevitable.
Had to see one of my little furbabies over the Rainbow Bridge the other week.
Admin
HTML cannot be properly parsed with a REGEX. Some tags (notably br and hr) don't have closing tags, while others don't require one. Additionally, tags can be nested infinitely deep and attributes may have a less than or greater than symbol in their value.
A company where I worked used a library from some big faceless org to convert a subset of HTML into a PDF. The idea was we would generate reports using this restrictive HTML subset and convert it to a PDF for the client.
Admin
I stared at "char let" for a like a minute wondering "what language is this?" before I realized it was just C# with a char variable named "let".
Admin
Please stop spreading this BS. We're not parsing HTML here, just recognizing tokens. (The same goes for many other languages, parsing is context-free or worse, but tokenizing is usually regular and often done with regex.)
Nested tags or tags without closing tags have absolutely no bearing on recognizing and lower-casing the tags. The only thing relevant of what you say may be <> within quotes. That however is regular because quotes cannot be nested. It's true, my regex doesn't do that, but neither does the original code. I leave it as an exercise to the reader to add a few characters to the regex to make it do that, as opposed to probably a few hundred lines of code and another StringBuilder to the original code.
Admin
Nope. Raw
<
or>
can appear in script and style tags.Also, HTML comments, but since they will not be part of the output, they don't matter too much.
Admin
Nope. Sou Eu didn't say anything about < or > in script and style tags, therefore that isn't relevant to the statement "The only thing relevant of what you say ...".
Admin
TRWTF is rolling your own HTML to PDF converter instead of just printing to PDF from a (headless, if need be) browser.