• chris (unregistered)

    i like that he replaces html with head and body.

  • SomeoneElse (unregistered)

    I also love how it catches an exception, only to throw it again. Code like this always makes me feel better about my code.

  • Cyberwizzard (unregistered)

    First!

    Omfg, thats not even funny... How can someone who obviously never heard of toLowerCase() (or its equivilents) write a HTML parser... I'd love to see what that thing does when I want to see its DOM tree... it does have a DOM tree right?.... I feel a headache coming up... sigh

  • darren (unregistered)

    wow. Maybe I'm just lazy, but after thirty lines of this, you just have to be tempted to find a better way.

  • Stof (unregistered)

    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

  • IV (unregistered)

    Does someone want to explain why HTML goes to body? and why every other version of html with an uppercase letter, or two, or three, goes to something else? I can actually understand why the rest of it is, if some intern starting by just lowercasing everything, and some one said "remember just to lowercase the tags". Sometimes people don't think when faced with a situation like that. I also wonder if they remebered every tag and if they typed it all manually.

    Captcha: stinky

  • Jimmie (unregistered)

    I'm betting that if he's expecting <hTml> as a tag, that theres no way in hell that the HTML is valid XML syntax. Thus an XML parser is just going to barf all over.

    in this scenario, one simple RegEx should do the trick

  • teqman (unregistered) in reply to Stof

    Regexps are fine for changing the case of tag names; it's not that hard to write a proper regexp to isolate elements of a well-formed HTML document.

  • anon (unregistered) in reply to Stof
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

  • Tarwn (unregistered) in reply to SomeoneElse
    I also love how it catches an exception, only to throw it again. Code like this always makes me feel better about my code.

    I love it when programmers can't handle concepts like bubbling and logical separation. Working on their code after the fact always proves painful but financially rewarding.

    Of course in this situation he's still eating the stack trace by explicitly re-throwing the exception... ah well. It wouldn't be a WTF without a few extras here and there. :P

  • God (unregistered)

    Intresting, he changes html to body :Þ

  • JAPH (unregistered)
    anon:
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    don't be ridiculous, scrabble doesn't have punctuation

  • IceFreak2000 (cs)
    anon:
    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    You are kidding, right? I know they can become unwieldy at times (the famous multi-line email expression springs to mind), but if you spend a small amount of time with regular expressions, they quickly become much more than mere 'line noise'.

    Oh, and I'm a C# developer, and I've not touched Perl in years!

  • RON (unregistered) in reply to anon
    anon:
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    Fail.

  • Bob Janova (cs)

    I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?

  • Bob (cs) in reply to JAPH
    JAPH:
    anon:
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    don't be ridiculous, scrabble doesn't have punctuation

    For perl all tiles should have punctuation (at least one item) printed on each blank face, then follow the above procedure...

    Mind you - I can't talk, I've just been scripting in bash as well... yuk

    Line noise expressions can be fun if well commented - pity that ny the time you've commented it the darned input has changed so it all crashes...

  • Welbog (cs) in reply to Bob Janova
    Bob Janova:
    I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?
    The idea is you can easily match elements using a simple regular expression and then call toLowerCase() on what is matched. Just a reluctant "<.+>" would do the trick to match any element.

    You'd have to worry about attributes, though, assuming you want to leave attribute value case alone.

  • Dan (unregistered)
    anon:
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    I think this is by far the biggest WTF out of this whole topic. Once I learned regexps, I can't imagine the clunky "WTF" type code I'd be writing if I didn't have them.

  • Stof (unregistered) in reply to Welbog

    Won't work by defaukt. regexps are greedy and such a patter will match all the text between the first < and the last >

  • Welbog (cs) in reply to Stof
    Stof:
    Won't work by defaukt. regexps are greedy and such a patter will match all the text between the first < and the last >
    I take it you missed it when I said reluctant.
  • Chris D (unregistered)

    I also think a few people are missing that this isn't the HTML parser. This is just a hack preprocessor for what we can only assume is an awful HTML parser.

    Regexs are difficult to use correctly for parsing HTML because of the greedy nature of most regexs. It's not impossible .. but hardly the easiest way.

  • line (unregistered)

    That's why he said "reluctant". <.+?> should do the trick in most cases, but that's not a standard.

  • Joseph Newton (unregistered) in reply to anon
    anon:
    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    Well, no. there are regular expressions, and there are unreadable regular expressions. Like any programming construct, a regexp can be misused. The ones that try to do everything in one line tend to be pretty ugly--and incredibly inefficient. A well-constructed regexp, one that handles a single issue, can be very efficient and not that hard to follow.

    anon:
    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)
    Again, there are Perl hacks, and there are those use use Perl as a tool in well-structured programming. The language does provide structures that support structured programming, but gives its users all the ammunition they need for shooting themselves in the foot.
  • Stof (unregistered) in reply to Welbog
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

  • Welbog (cs) in reply to Stof
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    You're right about that one, yeah. Like I said, special consideration is needed for the attributes. I would tend to agree with you that those special considerations shouldn't be handled with regular expressions.

  • Dan (unregistered) in reply to Stof
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    The regexp "parsing mechanisms" are only as restrictive as the programmer makes them...

  • JL (unregistered) in reply to Tarwn
    Tarwn:
    Of course in this situation he's still eating the stack trace by explicitly re-throwing the exception... ah well. It wouldn't be a WTF without a few extras here and there. :P
    Agreed. I've inherited a bunch of code where the original coder wrapped *every function* with an exception handler containing only "throw ex". Easy enough to fix, but so pointless. So, a PSA:

    Many VB.NET programmers don't seem to know this, but when you use "throw ex", it sets the exception's stack trace to that line. So if you are rethrowing an exception that you caught, "throw ex" will overwrite the exception's original location, making it much harder determine where the real problem is. Instead, you should either wrap the exception in a new exception and throw that, or just use the "throw" keyword on its own, which rethrows the exception without resetting its stack trace.

  • skington (cs) in reply to Stof

    Given that this is a pre-processor of some snippet (or sub-set) of HTML that is required because whatever comes after it can't handle upper- or mixed-case tags, I think there are more important things to do than make sure the lower-casing regex doesn't choke on tags with unusual attributes ;-).

    Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.

  • Marcin (unregistered)

    The real WTF is attempting to "preprocess" a flat text document. This code should not exist.

    captcha: quake

  • Hit (unregistered)

    I noticed the poster didn't include the "fix" regular expression. I can only suppose this is because it would lead to 4000 comments detailing all the cases it would miss.

    This is one of those problems which looks quite easy to deal with at first, but then turns out to be quite difficult when you dig into it. HTML is probably the loosest offshoot of the SGML languages, and parsing it to handle all the cases correctly is far from trivial.

  • musigenesis (cs)

    My favorite thing is that he passes in the string by reference (which is a good thing to do because it's probably a pretty long string, and no point in making an extra copy of it), and then tosses that tiny bit of efficiency away with all the "s = s.Replace(something, somethingelse)" lines.

  • Strider (unregistered) in reply to skington

    < [A-Za-z0-9]+) i dont get it, whats the parenthesis for...and why the leading space? how about

    <.[^ ]*

    i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)

  • Stof (unregistered) in reply to skington
    skington:
    Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.
    I see what you mean. It would probably work and only "corrupt" slightly things like : [image]

    In fact, it'll only corrupt the case there. Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

    Try to write a regexp to match the content of a tag inside a

  • skington (cs) in reply to Strider
    Strider:
    < [A-Za-z0-9]+) i dont get it, whats the parenthesis for...and why the leading space? how about

    <.[^ ]*

    i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)

    Meh, was meant to be < ([A-Za-z0-9]+)

    • i.e. capture the taggish-looking thing you matched, so you can call toLower or lc or whatever the function that lower-cases strings is in your language. And the space because the emerging standard in Perl is to use extended-format regexes that ignore white space, to try and make them more readable.

    Your regex won't match e.g.
    - no space.

  • ObiWayneKenobi (cs)

    Oh good sweet god. Love how it replaces the tags.

    And.. I don't get the deal with the Exceptions. It's bad to throw an exception in a try/catch block? But it's even worse to go: throw new Exception? So... what the hell do you do with the exception? I mean, isn't the proper error handling to throw it, so the calling method (in the presentation layer) can also implement try/catch and then display the error? You can't display the error in the business logic or data logic layers, so what more can you do except throw the exception?

  • Welbog (cs) in reply to Stof
    Stof:
    skington:
    Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.
    I see what you mean. It would probably work and only "corrupt" slightly things like : [image]

    In fact, it'll only corrupt the case there. Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

    Try to write a regexp to match the content of a tag inside a

    Indeed. The best way to go about normalizing a document like this isn't to pick out elements individually, but rather to start at the top of the document and work through it token by token (if not character by character), keeping track of whether the current character is an element (i.e. encountered a < but not > yet), and whether the current character is an attribute value (i.e. encountered a =" but not " yet), and changing the case accordingly. This way only requires one loop through the string, and only requires in-place changes to be made to the string (so if you're using a char[] instead of an immutable string you don't have to remake the string for each replace).

    But I like regular expressions. I'll use them where I can but here is obviously not the place for them.

  • skington (cs) in reply to Stof
    Stof:
    skington:
    Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.
    I see what you mean. It would probably work and only "corrupt" slightly things like : [image]

    In fact, it'll only corrupt the case there.

    Nope - it will stop at the space (which isn't part of the [A-Za-z0-9] match), turning [image] into [image]. Assuming that the replace part of your regex lower-cases the stuff that was matched.

    Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

    Try to write a regexp to match the content of a tag inside a

    With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.

  • Hans (unregistered)

    AFAICS, the best thing is that this case-conversion function catches "ex", and then throws "Ex"..

    God, I am lucky that I work with the programming languages of my owhn choice! :-)

  • grg (unregistered)

    Ahem, I havent seen anybody menthing this so far, but IMHO the beiiger WTF is thinking you can parse HTML with regular expressions.

    That's basically impossible.

    You see regular expressions, are cryptic, but REGULAR.

    And HTML isnt.

    For example you can embed almost anything into a HTML document-- JavaScript, PHP, and worse. Each one of those languages has it's own syntax. Worse yet, that code tends to have quite a few HTML fragments in quotes of various kinds.

    To correctly parse HTML you need a lexical scanner that goes through the HTML from top to bottom, interpreting every token it sees, and doing the right thing depending on context.

    Regular expressions just don't give that kind of flexibility. At least not as they're commonly used.

    I suppose you could scan the input one character at a time with a reguiular expression, but that would not be in any way fast, or cleean, or probably reliable.

  • brian (unregistered)
    Comment held for moderation.
  • JR (unregistered)

    This gave me an instant headache.

    Captcha: doom (very appropriate)

  • YourCoke (unregistered)

    Well, content.lower() is much easier, I think. Its time to switch to a nice language like python or learn the power of reg. expressions. Maybe both.

  • joe (unregistered) in reply to brian
    Comment held for moderation.
  • JL (unregistered) in reply to musigenesis
    musigenesis:
    My favorite thing is that he passes in the string by reference (which is a good thing to do because it's probably a pretty long string, and no point in making an extra copy of it), and then tosses that tiny bit of efficiency away with all the "s = s.Replace(something, somethingelse)" lines.
    Actually, String is a reference type in .NET, so it's actually marginally less efficient to use the ByRef keyword here -- it passes a reference to a reference. The reason he needs the ByRef keyword is because the subroutine modifies the contents of the parameter. If he had used ByVal instead, the subroutine would have no effect in the scope of the calling procedure.

    Of course, it would have been better style to use a ByVal parameter and to return the results from a function instead of a subroutine. Methods that modify their parameters in the calling scope are unexpected and should thus be avoided when possible.

  • skington (cs) in reply to brian
    brian:
    Anyone who doubts regular expressions or PERL is advised to go here: http://99-bottles-of-beer.net/language-perl-737.html

    and run that code.

    Or Abigail's prime number identifier:

    sub is_prime { my ($number) = @_; return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \Z/xms; }

    Although this is no longer a regular expression ;-).

  • Stof (unregistered) in reply to skington
    skington:
    Nope - it will stop at the space (which isn't part of the [A-Za-z0-9] match), turning [image] into [image]. Assuming that the replace part of your regex lower-cases the stuff that was matched.
    Nope, it'll also corrupt the text inside the TITLE attribute. It'll lowercase the B.
    skington:
    With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.
    I doubt you can do it. For the same reasons you shouldn't use regexps to parse mathematical expressions. You can make a shallow parse of an XML document though. And you'll have to code something to process the result after that of course.
  • savar (cs)

    The real WTF (besides having to scroll to the bottom to reply to the first post) is this:

    He wanted to learn more about regular expressions and, realizing that all HTML tools are probably written with regular expressions, decided to dig into one.
    No, most HTML parsers are written with state machines. Regexes are totally inappropriate for most HTML operations.

    In the given example Regex is okay becase you're really doing a blind find-and-replace operation without worry about validity of HTML.

    Still, I would wager money that the "fix" to this code involves some WTFs that would be totally abused by malformed HTML.

  • snoofle (cs) in reply to JAPH
    JAPH:
    anon:
    Stof:
    The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    don't be ridiculous, scrabble doesn't have punctuation

    Sure it does - that's what the blank tiles are for (aren't they?)

  • anon (unregistered) in reply to Dan
    Dan:
    I think this is by far the biggest WTF out of this whole topic.

    Actually I think the bigger WTF is that you (and others) taking such an obvious troll seriously.

    (come on - Perl, "line noise", scrabble tiles, I couldn't have been much more obvious)

    It never ceases to me amuse me how most of the commenters on this site can't spot deadpan sarcasm, even though it's the "house tone" of the entire site

  • DSTMan (unregistered)
    anon:
    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    You shouldn't have gone anonymous so that we could properly attribute this most wonderful true sentence to you. Thumbs way up! :winky:

    captcha: xevious - eh. I dislike SHMUPS.

Leave a comment on “Just in Case”

Log In or post as a guest

Replying to comment #:

« Return to Article