Comment On Just in Case

 The best thing about HTML is that (historically) consumers are insanely permissive about what they accept. The most obvious thing is that --- casting off the iron fist of XML --- hypertext markup allows tags to be case-insensitive. It is widely accepted that this helps designers more fully express themselves on the web. [expand full text]
« PrevPage 1 | Page 2 | Page 3Next »

Re: Just in Case

2007-04-03 09:03 • by chris (unregistered)
i like that he replaces html with head and body.

Re: Just in Case

2007-04-03 09:04 • by SomeoneElse (unregistered)
I also love how it catches an exception, only to throw it again. Code like this always makes me feel better about my code.

Re: Just in Case

2007-04-03 09:04 • by Cyberwizzard (unregistered)
First!

Omfg, thats not even funny... How can someone who obviously never heard of toLowerCase() (or its equivilents) write a HTML parser... I'd love to see what that thing does when I want to see its DOM tree... it does have a DOM tree right?.... I feel a headache coming up... *sigh*

Re: Just in Case

2007-04-03 09:05 • by darren (unregistered)
wow. Maybe I'm just lazy, but after thirty lines of this, you just have to be tempted to find a better way.

Re: Just in Case

2007-04-03 09:06 • by Stof (unregistered)
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.

Re: Just in Case

2007-04-03 09:12 • by IV (unregistered)
Does someone want to explain why HTML goes to body? and why every other version of html with an uppercase letter, or two, or three, goes to something else? I can actually understand why the rest of it is, if some intern starting by just lowercasing everything, and some one said "remember just to lowercase the tags". Sometimes people don't think when faced with a situation like that. I also wonder if they remebered every tag and if they typed it all manually.

Captcha: stinky

Re: Just in Case

2007-04-03 09:13 • by Jimmie (unregistered)
I'm betting that if he's expecting <hTml> as a tag, that theres no way in hell that the HTML is valid XML syntax. Thus an XML parser is just going to barf all over.

in *this* scenario, one simple RegEx should do the trick

Re: Just in Case

2007-04-03 09:13 • by teqman (unregistered)
129932 in reply to 129929
Regexps are fine for changing the case of tag names; it's not that hard to write a proper regexp to isolate elements of a well-formed HTML document.

Re: Just in Case

2007-04-03 09:13 • by anon (unregistered)
129933 in reply to 129929
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

Re: Just in Case

2007-04-03 09:17 • by Tarwn (unregistered)
129935 in reply to 129924
I also love how it catches an exception, only to throw it again. Code like this always makes me feel better about my code.


I love it when programmers can't handle concepts like bubbling and logical separation. Working on their code after the fact always proves painful but financially rewarding.


Of course in this situation he's still eating the stack trace by explicitly re-throwing the exception... ah well. It wouldn't be a WTF without a few extras here and there.
:P

Re: Just in Case

2007-04-03 09:18 • by God (unregistered)
Intresting, he changes html to body :Þ

Re: Just in Case

2007-04-03 09:19 • by JAPH (unregistered)
anon:
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


don't be ridiculous, scrabble doesn't have punctuation

Re: Just in Case

2007-04-03 09:21 • by IceFreak2000
anon:

Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


You are kidding, right? I know they can become unwieldy at times (the famous multi-line email expression springs to mind), but if you spend a small amount of time with regular expressions, they quickly become much more than mere 'line noise'.

Oh, and I'm a C# developer, and I've not touched Perl in years!

Re: Just in Case

2007-04-03 09:22 • by RON (unregistered)
129939 in reply to 129933
anon:
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)



Fail.

Re: Just in Case

2007-04-03 09:22 • by Bob Janova
I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?

Re: Just in Case

2007-04-03 09:25 • by Bob
129942 in reply to 129937
JAPH:
anon:
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


don't be ridiculous, scrabble doesn't have punctuation


For perl all tiles should have punctuation (at least one item) printed on each blank face, then follow the above procedure...

Mind you - I can't talk, I've just been scripting in bash as well... yuk

Line noise expressions can be fun if well commented - pity that ny the time you've commented it the darned input has changed so it all crashes...

Re: Just in Case

2007-04-03 09:29 • by Welbog
129943 in reply to 129941
Bob Janova:
I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?
The idea is you can easily match elements using a simple regular expression and then call toLowerCase() on what is matched. Just a reluctant "<.+>" would do the trick to match any element.

You'd have to worry about attributes, though, assuming you want to leave attribute value case alone.

Re: Just in Case

2007-04-03 09:31 • by Dan (unregistered)
anon:
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


I think this is by far the biggest WTF out of this whole topic. Once I learned regexps, I can't imagine the clunky "WTF" type code I'd be writing if I didn't have them.

Re: Just in Case

2007-04-03 09:33 • by Stof (unregistered)
129945 in reply to 129943
Won't work by defaukt. regexps are greedy and such a patter will match all the text between the first < and the last >

Re: Just in Case

2007-04-03 09:37 • by Welbog
129947 in reply to 129945
Stof:
Won't work by defaukt. regexps are greedy and such a patter will match all the text between the first < and the last >
I take it you missed it when I said reluctant.

Re: Just in Case

2007-04-03 09:39 • by Chris D (unregistered)
I also think a few people are missing that this isn't the HTML parser. This is just a hack preprocessor for what we can only assume is an awful HTML parser.

Regexs are difficult to use correctly for parsing HTML because of the greedy nature of most regexs. It's not impossible .. but hardly the easiest way.

Re: Just in Case

2007-04-03 09:41 • by line (unregistered)
That's why he said "reluctant".
<.+?> should do the trick in most cases, but that's not a standard.

Re: Just in Case

2007-04-03 09:41 • by Joseph Newton (unregistered)
129950 in reply to 129933
anon:

Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

Well, no. there are regular expressions, and there are unreadable regular expressions. Like any programming construct, a regexp can be misused. The ones that try to do everything in one line tend to be pretty ugly--and incredibly inefficient. A well-constructed regexp, one that handles a single issue, can be very efficient and not that hard to follow.

anon:

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

Again, there are Perl hacks, and there are those use use Perl as a tool in well-structured programming. The language does provide structures that support structured programming, but gives its users all the ammunition they need for shooting themselves in the foot.

Re: Just in Case

2007-04-03 09:42 • by Stof (unregistered)
129951 in reply to 129947
Welbog:
I take it you missed it when I said reluctant.
Not really, it's just that I didn't knew that term.

What about :

<tag value="value with a > just to mess with regexp parsing mechanisms">

Face it, regexp where never meant to parse xml :)

Re: Just in Case

2007-04-03 09:49 • by Welbog
129953 in reply to 129951
Stof:
Welbog:
I take it you missed it when I said reluctant.
Not really, it's just that I didn't knew that term.

What about :

<tag value="value with a > just to mess with regexp parsing mechanisms">

Face it, regexp where never meant to parse xml :)
You're right about that one, yeah. Like I said, special consideration is needed for the attributes. I would tend to agree with you that those special considerations shouldn't be handled with regular expressions.

Re: Just in Case

2007-04-03 09:50 • by Dan (unregistered)
129954 in reply to 129951
Stof:
Welbog:
I take it you missed it when I said reluctant.
Not really, it's just that I didn't knew that term.

What about :

<tag value="value with a > just to mess with regexp parsing mechanisms">

Face it, regexp where never meant to parse xml :)


The regexp "parsing mechanisms" are only as restrictive as the programmer makes them...

Re: Just in Case

2007-04-03 09:52 • by JL (unregistered)
129955 in reply to 129935
Tarwn:
Of course in this situation he's still eating the stack trace by explicitly re-throwing the exception... ah well. It wouldn't be a WTF without a few extras here and there.
:P

Agreed. I've inherited a bunch of code where the original coder wrapped *every function* with an exception handler containing only "throw ex". Easy enough to fix, but so pointless. So, a PSA:

Many VB.NET programmers don't seem to know this, but when you use "throw ex", it sets the exception's stack trace to that line. So if you are rethrowing an exception that you caught, "throw ex" will overwrite the exception's original location, making it much harder determine where the real problem is. Instead, you should either wrap the exception in a new exception and throw that, or just use the "throw" keyword on its own, which rethrows the exception without resetting its stack trace.

Re: Just in Case

2007-04-03 09:53 • by skington
129956 in reply to 129951
Given that this is a pre-processor of some snippet (or sub-set) of HTML that is required because whatever comes after it can't handle upper- or mixed-case tags, I think there are more important things to do than make sure the lower-casing regex doesn't choke on tags with unusual attributes ;-).

Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.

Re: Just in Case

2007-04-03 09:55 • by Marcin (unregistered)
The real WTF is attempting to "preprocess" a flat text document. This code should not exist.

captcha: quake

Re: Just in Case

2007-04-03 09:55 • by Hit (unregistered)
I noticed the poster didn't include the "fix" regular expression. I can only suppose this is because it would lead to 4000 comments detailing all the cases it would miss.

This is one of those problems which looks quite easy to deal with at first, but then turns out to be quite difficult when you dig into it. HTML is probably the loosest offshoot of the SGML languages, and parsing it to handle all the cases correctly is far from trivial.

Re: Just in Case

2007-04-03 09:55 • by musigenesis
My favorite thing is that he passes in the string by reference (which is a good thing to do because it's probably a pretty long string, and no point in making an extra copy of it), and then tosses that tiny bit of efficiency away with all the "s = s.Replace(something, somethingelse)" lines.

Re: Just in Case

2007-04-03 10:02 • by Strider (unregistered)
129960 in reply to 129956
< [A-Za-z0-9]+)
i dont get it, whats the parenthesis for...and why the leading space?
how about

<.[^ ]*

i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)

Re: Just in Case

2007-04-03 10:07 • by Stof (unregistered)
129961 in reply to 129956
skington:
Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.

I see what you mean. It would probably work and only "corrupt" slightly things like :
<img title="case for A<B">

In fact, it'll only corrupt the case there. Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

Try to write a regexp to match the content of a <a> tag inside a <div> tag for example.

Re: Just in Case

2007-04-03 10:09 • by skington
129962 in reply to 129960
Strider:
< [A-Za-z0-9]+)
i dont get it, whats the parenthesis for...and why the leading space?
how about

<.[^ ]*

i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)


Meh, was meant to be
< ([A-Za-z0-9]+)
- i.e. capture the taggish-looking thing you matched, so you can call toLower or lc or whatever the function that lower-cases strings is in your language.
And the space because the emerging standard in Perl is to use extended-format regexes that ignore white space, to try and make them more readable.

Your regex won't match e.g. <br> - no space.

Re: Just in Case

2007-04-03 10:11 • by ObiWayneKenobi
Oh good sweet god. Love how it replaces the tags.

And.. I don't get the deal with the Exceptions. It's bad to throw an exception in a try/catch block? But it's even worse to go: throw new Exception? So... what the hell do you do with the exception? I mean, isn't the proper error handling to throw it, so the calling method (in the presentation layer) can also implement try/catch and then display the error? You can't display the error in the business logic or data logic layers, so what more can you do except throw the exception?

Re: Just in Case

2007-04-03 10:13 • by Welbog
129964 in reply to 129961
Stof:
skington:
Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.

I see what you mean. It would probably work and only "corrupt" slightly things like :
<img title="case for A<B">

In fact, it'll only corrupt the case there. Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

Try to write a regexp to match the content of a <a> tag inside a <div> tag for example.
Indeed. The best way to go about normalizing a document like this isn't to pick out elements individually, but rather to start at the top of the document and work through it token by token (if not character by character), keeping track of whether the current character is an element (i.e. encountered a < but not > yet), and whether the current character is an attribute value (i.e. encountered a =" but not " yet), and changing the case accordingly. This way only requires one loop through the string, and only requires in-place changes to be made to the string (so if you're using a char[] instead of an immutable string you don't have to remake the string for each replace).

But I like regular expressions. I'll use them where I can but here is obviously not the place for them.

Re: Just in Case

2007-04-03 10:15 • by skington
129965 in reply to 129961
Stof:
skington:
Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.

I see what you mean. It would probably work and only "corrupt" slightly things like :
<img title="case for A<B">

In fact, it'll only corrupt the case there.


Nope - it will stop at the space (which isn't part of the [A-Za-z0-9] match), turning <IMG TITLE="CASE FOR A<B"> into <img TITLE="CASE FOR A<B">. Assuming that the replace part of your regex lower-cases the stuff that was matched.

Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

Try to write a regexp to match the content of a <a> tag inside a <div> tag for example.


With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.

Re: Just in Case

2007-04-03 10:16 • by Hans (unregistered)
AFAICS, the best thing is that this case-conversion function catches "ex", and then throws "Ex"..

God, I am lucky that I work with the programming languages of my owhn choice! :-)

Re: Just in Case

2007-04-03 10:16 • by grg (unregistered)
Ahem, I havent seen anybody menthing this so far, but IMHO the beiiger WTF is thinking you can parse HTML with regular expressions.

That's basically impossible.

You see regular expressions, are cryptic, but REGULAR.

And HTML isnt.

For example you can embed almost anything into a HTML document-- JavaScript, PHP, and worse. Each one of those languages has it's own syntax. Worse yet, that code tends to have quite a few HTML fragments in quotes of various kinds.

To correctly parse HTML you need a lexical scanner that goes through the HTML from top to bottom, interpreting every token it sees, and doing the right thing depending on context.

Regular expressions just don't give that kind of flexibility. At least not as they're commonly used.

I suppose you could scan the input one character at a time with a reguiular expression, but that would not be in any way fast, or cleean, or probably reliable.



Re: Just in Case

2007-04-03 10:19 • by brian (unregistered)
Anyone who doubts regular expressions or PERL is advised to go here:
http://99-bottles-of-beer.net/language-perl-737.html

and run that code.

Re: Just in Case

2007-04-03 10:22 • by JR (unregistered)
This gave me an instant headache.

Captcha: doom (very appropriate)

Re: Just in Case

2007-04-03 10:22 • by YourCoke (unregistered)
Well, content.lower() is much easier, I think.
Its time to switch to a nice language like python or learn the power of reg. expressions.
Maybe both.

Re: Just in Case

2007-04-03 10:23 • by joe (unregistered)
129973 in reply to 129969
brian:
Anyone who doubts regular expressions or PERL is advised to go here:
http://99-bottles-of-beer.net/language-perl-737.html

and run that code.


haha i like how at the end it goes from plural beers to singular beer

Re: Just in Case

2007-04-03 10:24 • by JL (unregistered)
129974 in reply to 129959
musigenesis:
My favorite thing is that he passes in the string by reference (which is a good thing to do because it's probably a pretty long string, and no point in making an extra copy of it), and then tosses that tiny bit of efficiency away with all the "s = s.Replace(something, somethingelse)" lines.

Actually, String is a reference type in .NET, so it's actually marginally less efficient to use the ByRef keyword here -- it passes a reference to a reference. The reason he needs the ByRef keyword is because the subroutine modifies the contents of the parameter. If he had used ByVal instead, the subroutine would have no effect in the scope of the calling procedure.

Of course, it would have been better style to use a ByVal parameter and to return the results from a function instead of a subroutine. Methods that modify their parameters in the calling scope are unexpected and should thus be avoided when possible.

Re: Just in Case

2007-04-03 10:25 • by skington
129975 in reply to 129969
brian:
Anyone who doubts regular expressions or PERL is advised to go here:
http://99-bottles-of-beer.net/language-perl-737.html

and run that code.


Or Abigail's prime number identifier:

sub is_prime {
my ($number) = @_;
return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \Z/xms;
}

Although this is no longer a regular expression ;-).

Re: Just in Case

2007-04-03 10:28 • by Stof (unregistered)
129976 in reply to 129965
skington:
Nope - it will stop at the space (which isn't part of the [A-Za-z0-9] match), turning <IMG TITLE="CASE FOR A<B"> into <img TITLE="CASE FOR A<B">. Assuming that the replace part of your regex lower-cases the stuff that was matched.

Nope, it'll also corrupt the text inside the TITLE attribute. It'll lowercase the B.

skington:
With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.

I doubt you can do it. For the same reasons you shouldn't use regexps to parse mathematical expressions. You can make a shallow parse of an XML document though. And you'll have to code something to process the result after that of course.

Re: Just in Case

2007-04-03 10:32 • by savar
The real WTF (besides having to scroll to the bottom to reply to the first post) is this:
He wanted to learn more about regular expressions and, realizing that all HTML tools are probably written with regular expressions, decided to dig into one.

No, most HTML parsers are written with state machines. Regexes are totally inappropriate for most HTML operations.

In the given example Regex is okay becase you're really doing a blind find-and-replace operation without worry about validity of HTML.

Still, I would wager money that the "fix" to this code involves some WTFs that would be totally abused by malformed HTML.

Re: Just in Case

2007-04-03 10:33 • by snoofle
129978 in reply to 129937
JAPH:
anon:
Stof:
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.


Regexps are inappropriate for human exposure, period.

Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


don't be ridiculous, scrabble doesn't have punctuation

Sure it does - that's what the blank tiles are for (aren't they?)

Re: Just in Case

2007-04-03 10:35 • by anon (unregistered)
129979 in reply to 129944
Dan:

I think this is by far the biggest WTF out of this whole topic.


Actually I think the bigger WTF is that you (and others) taking such an obvious troll seriously.

(come on - Perl, "line noise", scrabble tiles, I couldn't have been much more obvious)

It never ceases to me amuse me how most of the commenters on this site can't spot deadpan sarcasm, even though it's the "house tone" of the entire site

Re: Just in Case

2007-04-03 10:38 • by DSTMan (unregistered)
anon:

("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)


You shouldn't have gone anonymous so that we could properly attribute this most wonderful true sentence to you. Thumbs way up! :winky:

captcha: xevious - eh. I dislike SHMUPS.
« PrevPage 1 | Page 2 | Page 3Next »

Add Comment