• (nodebb)

    It's worse than that, actually, since it will break anything that depends on the case of the attrributes in a tag. Like, say, that goofy server over there that provides the images for <img src="bla-di-bla"> tags and provides different images depending on the case of the tag (yes, that's a WTF as well, but...).

    And @Remy, I hope your cat's problem is nothing too serious.

  • Ullli (unregistered)

    What about a base64 encoded image? [image]

  • (nodebb) in reply to Steve_The_Cynic

    Or perhaps more pointedly, it would turn [image] into [image] potentially breaking any such linking if the images are hosted on a case-sensitive filesystem.

  • Foo AKA Fooo (unregistered) in reply to Steve_The_Cynic

    URLs are case-sensitive in general (except for the scheme and host name), so that's not a server WTF, but indeed a breakage of the code. If that's not enough, the same would apply e.g. to alt tags which you probably don't want to lower-case.

    But of course, the main purpose of this code is to remind us why regexes are so evil. I mean, who could make any sense out of "s/(<.*?>)/\L$1/g" (untested) when we can have 46 lines of kind of readable code with just a few bugs, inefficiencies and misleading comments instead?

  • sorry to be pedantic (unregistered) in reply to Foo AKA Fooo

    Sorry to be pedantic, but "s/(<[^>]*?>)/\L$1/g"

  • (author) in reply to Steve_The_Cynic

    Thanks, the news is not great and will require some extreme care, but the prognosis is good if we can get care started soon- which we should be able to do.

  • Jason Stringify (unregistered)

    As a JavaScript programmer I am triggered by the use of "let" as a variable name..

  • dpm (unregistered)

    Surely none of the "case" or regex or attributes is the main problem. TRWTF is this induhvidual using "continue"s instead of "else if"s.

  • Sauron (unregistered)

    The code will also break easily when there's an attribute value that contains a >

    Example: <div data-some-stuff=">">

    TRWTF is parsing HTML manually like that.

    They should have installed an HTML parser.

    Heck, if it's an enterprise project. their codebase probably already has 5 different HTML parsers installed already!

  • (nodebb)

    And simply using pandoc would make the whole thing unneeded in the first place.

  • (nodebb) in reply to Athanasius

    @Athanasius that was more or less exactly my point.

    @Foo AKA Fooo I wasn't sure exactly how far the case-insensitivity goes in URLs, but if it's part of the standards that the path is case sensitive, then it's a good solid WTF just for downcasing the entire contents of the tag. The alt text is, at that point, just collateral damage.

    Also, re: regexes : "s/(<.*?>)/\L$1/g" (untested) is, indeed, not correct because it will downcase "THING" if a line of HTML contains <b>THING</b> (because .* is greedy...

  • CodeMonkey403 (unregistered) in reply to Steve_The_Cynic

    @Steve The Cynic

    In the case of "s/(<.?>)/\L$1/g" the "." is followed by a "?" which makes it non-greedy. Or whatever the correct term is.

  • (nodebb) in reply to Remy Porter

    Sorry to hear about your cat, Remy. We recently got some not-good news about one of our cats. It's always difficult when this happens. I wish you and your furbaby all the very best <3

  • Xan (unregistered)

    It's possible that this is a pre-processor for an existing HTML-to-PDF tool such as Flying Saucer. Flying Saucer operates only on XHTML as sons so much easier to parse. XHTML requires tags be in lower case.

  • (nodebb) in reply to CodeMonkey403

    Ugh. Thanks.

  • (nodebb)

    "If this code is necessary at all, it's because something they're using to parse HTML is itself broken, and expecting a specific casing for tags."

    Or a manager or customer who demands that tags be in lowercase for reasons.

    Aesthetic requirements being added in places where they weren't necessary for code to function but where some internal requirement or personal preference was driving the choice happened more than once on projects I was on back in the day.

  • Zimbu (unregistered) in reply to Sauron

    I was going to make the point that this breaks on ill-formed HTML, but yours is better because it breaks the same way on WELL-formed HTML.

  • Officer Johnny Holzkopf (unregistered) in reply to Remy Porter

    Get well soon! =^_^=

  • MaxiTB (unregistered)

    There article says it's a PDF to HTML converter, but the method is handling RTF (rich text). Or is it?

  • LZ79LRU (unregistered) in reply to MaxiTB

    This looks more like poor text to me. Or maybe desperate text.

  • Prime Mover (unregistered)

    Yeah, you get cats, you get used to the inevitable.

    Had to see one of my little furbabies over the Rainbow Bridge the other week.

  • Sou Eu (unregistered) in reply to Foo AKA Fooo

    HTML cannot be properly parsed with a REGEX. Some tags (notably br and hr) don't have closing tags, while others don't require one. Additionally, tags can be nested infinitely deep and attributes may have a less than or greater than symbol in their value.

    A company where I worked used a library from some big faceless org to convert a subset of HTML into a PDF. The idea was we would generate reports using this restrictive HTML subset and convert it to a PDF for the client.

  • (nodebb) in reply to Jason Stringify

    I stared at "char let" for a like a minute wondering "what language is this?" before I realized it was just C# with a char variable named "let".

  • Foo AKA Fooo (unregistered) in reply to Sou Eu

    Please stop spreading this BS. We're not parsing HTML here, just recognizing tokens. (The same goes for many other languages, parsing is context-free or worse, but tokenizing is usually regular and often done with regex.)

    Nested tags or tags without closing tags have absolutely no bearing on recognizing and lower-casing the tags. The only thing relevant of what you say may be <> within quotes. That however is regular because quotes cannot be nested. It's true, my regex doesn't do that, but neither does the original code. I leave it as an exercise to the reader to add a few characters to the regex to make it do that, as opposed to probably a few hundred lines of code and another StringBuilder to the original code.

  • markm (unregistered)
    Comment held for moderation.
  • (nodebb) in reply to Foo AKA Fooo

    The only thing relevant of what you say may be <> within quotes.

    Nope. Raw < or > can appear in script and style tags.

    Also, HTML comments, but since they will not be part of the output, they don't matter too much.

  • I'm not a robot (unregistered) in reply to jeremypnet
    The only thing relevant of what you say may be <> within quotes.

    Nope. Raw < or > can appear in script and style tags.

    Nope. Sou Eu didn't say anything about < or > in script and style tags, therefore that isn't relevant to the statement "The only thing relevant of what you say ...".

  • --- (unregistered)

    TRWTF is rolling your own HTML to PDF converter instead of just printing to PDF from a (headless, if need be) browser.

Leave a comment on “A Case of Conversion”

Log In or post as a guest

Replying to comment #:

« Return to Article