- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
i like that he replaces html with head and body.
Admin
I also love how it catches an exception, only to throw it again. Code like this always makes me feel better about my code.
Admin
First!
Omfg, thats not even funny... How can someone who obviously never heard of toLowerCase() (or its equivilents) write a HTML parser... I'd love to see what that thing does when I want to see its DOM tree... it does have a DOM tree right?.... I feel a headache coming up... sigh
Admin
wow. Maybe I'm just lazy, but after thirty lines of this, you just have to be tempted to find a better way.
Admin
The real WTF is that he is using regexps to manipulate an xml/html document. Regexps are inapropriate for xml parsing.
Admin
Does someone want to explain why HTML goes to body? and why every other version of html with an uppercase letter, or two, or three, goes to something else? I can actually understand why the rest of it is, if some intern starting by just lowercasing everything, and some one said "remember just to lowercase the tags". Sometimes people don't think when faced with a situation like that. I also wonder if they remebered every tag and if they typed it all manually.
Captcha: stinky
Admin
I'm betting that if he's expecting <hTml> as a tag, that theres no way in hell that the HTML is valid XML syntax. Thus an XML parser is just going to barf all over.
in this scenario, one simple RegEx should do the trick
Admin
Regexps are fine for changing the case of tag names; it's not that hard to write a proper regexp to isolate elements of a well-formed HTML document.
Admin
Regexps are inappropriate for human exposure, period.
Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.
("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)
Admin
I love it when programmers can't handle concepts like bubbling and logical separation. Working on their code after the fact always proves painful but financially rewarding.
Of course in this situation he's still eating the stack trace by explicitly re-throwing the exception... ah well. It wouldn't be a WTF without a few extras here and there. :P
Admin
Intresting, he changes html to body :Þ
Admin
don't be ridiculous, scrabble doesn't have punctuation
Admin
You are kidding, right? I know they can become unwieldy at times (the famous multi-line email expression springs to mind), but if you spend a small amount of time with regular expressions, they quickly become much more than mere 'line noise'.
Oh, and I'm a C# developer, and I've not touched Perl in years!
Admin
Fail.
Admin
I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?
Admin
For perl all tiles should have punctuation (at least one item) printed on each blank face, then follow the above procedure...
Mind you - I can't talk, I've just been scripting in bash as well... yuk
Line noise expressions can be fun if well commented - pity that ny the time you've commented it the darned input has changed so it all crashes...
Admin
You'd have to worry about attributes, though, assuming you want to leave attribute value case alone.
Admin
I think this is by far the biggest WTF out of this whole topic. Once I learned regexps, I can't imagine the clunky "WTF" type code I'd be writing if I didn't have them.
Admin
Won't work by defaukt. regexps are greedy and such a patter will match all the text between the first < and the last >
Admin
Admin
I also think a few people are missing that this isn't the HTML parser. This is just a hack preprocessor for what we can only assume is an awful HTML parser.
Regexs are difficult to use correctly for parsing HTML because of the greedy nature of most regexs. It's not impossible .. but hardly the easiest way.
Admin
That's why he said "reluctant". <.+?> should do the trick in most cases, but that's not a standard.
Admin
Admin
What about :
<tag value="value with a > just to mess with regexp parsing mechanisms">Face it, regexp where never meant to parse xml :)
Admin
Admin
The regexp "parsing mechanisms" are only as restrictive as the programmer makes them...
Admin
Many VB.NET programmers don't seem to know this, but when you use "throw ex", it sets the exception's stack trace to that line. So if you are rethrowing an exception that you caught, "throw ex" will overwrite the exception's original location, making it much harder determine where the real problem is. Instead, you should either wrap the exception in a new exception and throw that, or just use the "throw" keyword on its own, which rethrows the exception without resetting its stack trace.
Admin
Given that this is a pre-processor of some snippet (or sub-set) of HTML that is required because whatever comes after it can't handle upper- or mixed-case tags, I think there are more important things to do than make sure the lower-casing regex doesn't choke on tags with unusual attributes ;-).
Anyway, in this case, your best bet is to match < [A-Za-z0-9]+) repeatedly and not bother about anything else. If the point of the transformation is to lower-case tags (and not attributes), then that's all you'll need. Unless you can have anything other than alphanumerics in a tag - I forget.
Admin
The real WTF is attempting to "preprocess" a flat text document. This code should not exist.
captcha: quake
Admin
I noticed the poster didn't include the "fix" regular expression. I can only suppose this is because it would lead to 4000 comments detailing all the cases it would miss.
This is one of those problems which looks quite easy to deal with at first, but then turns out to be quite difficult when you dig into it. HTML is probably the loosest offshoot of the SGML languages, and parsing it to handle all the cases correctly is far from trivial.
Admin
My favorite thing is that he passes in the string by reference (which is a good thing to do because it's probably a pretty long string, and no point in making an extra copy of it), and then tosses that tiny bit of efficiency away with all the "s = s.Replace(something, somethingelse)" lines.
Admin
< [A-Za-z0-9]+) i dont get it, whats the parenthesis for...and why the leading space? how about
<.[^ ]*
i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)
Admin
In fact, it'll only corrupt the case there. Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.
Try to write a regexp to match the content of a tag inside a
Admin
Meh, was meant to be < ([A-Za-z0-9]+)
Your regex won't match e.g.
- no space.
Admin
Oh good sweet god. Love how it replaces the tags.
And.. I don't get the deal with the Exceptions. It's bad to throw an exception in a try/catch block? But it's even worse to go: throw new Exception? So... what the hell do you do with the exception? I mean, isn't the proper error handling to throw it, so the calling method (in the presentation layer) can also implement try/catch and then display the error? You can't display the error in the business logic or data logic layers, so what more can you do except throw the exception?
Admin
But I like regular expressions. I'll use them where I can but here is obviously not the place for them.
Admin
Nope - it will stop at the space (which isn't part of the [A-Za-z0-9] match), turning [image] into [image]. Assuming that the replace part of your regex lower-cases the stuff that was matched.
With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.
Admin
AFAICS, the best thing is that this case-conversion function catches "ex", and then throws "Ex"..
God, I am lucky that I work with the programming languages of my owhn choice! :-)
Admin
Ahem, I havent seen anybody menthing this so far, but IMHO the beiiger WTF is thinking you can parse HTML with regular expressions.
That's basically impossible.
You see regular expressions, are cryptic, but REGULAR.
And HTML isnt.
For example you can embed almost anything into a HTML document-- JavaScript, PHP, and worse. Each one of those languages has it's own syntax. Worse yet, that code tends to have quite a few HTML fragments in quotes of various kinds.
To correctly parse HTML you need a lexical scanner that goes through the HTML from top to bottom, interpreting every token it sees, and doing the right thing depending on context.
Regular expressions just don't give that kind of flexibility. At least not as they're commonly used.
I suppose you could scan the input one character at a time with a reguiular expression, but that would not be in any way fast, or cleean, or probably reliable.
Admin
Anyone who doubts regular expressions or PERL is advised to go here: http://99-bottles-of-beer.net/language-perl-737.html
and run that code.
Admin
This gave me an instant headache.
Captcha: doom (very appropriate)
Admin
Well, content.lower() is much easier, I think. Its time to switch to a nice language like python or learn the power of reg. expressions. Maybe both.
Admin
haha i like how at the end it goes from plural beers to singular beer
Admin
Of course, it would have been better style to use a ByVal parameter and to return the results from a function instead of a subroutine. Methods that modify their parameters in the calling scope are unexpected and should thus be avoided when possible.
Admin
Or Abigail's prime number identifier:
sub is_prime { my ($number) = @_; return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \Z/xms; }
Although this is no longer a regular expression ;-).
Admin
Admin
The real WTF (besides having to scroll to the bottom to reply to the first post) is this:
No, most HTML parsers are written with state machines. Regexes are totally inappropriate for most HTML operations.In the given example Regex is okay becase you're really doing a blind find-and-replace operation without worry about validity of HTML.
Still, I would wager money that the "fix" to this code involves some WTFs that would be totally abused by malformed HTML.
Admin
Admin
Actually I think the bigger WTF is that you (and others) taking such an obvious troll seriously.
(come on - Perl, "line noise", scrabble tiles, I couldn't have been much more obvious)
It never ceases to me amuse me how most of the commenters on this site can't spot deadpan sarcasm, even though it's the "house tone" of the entire site
Admin
You shouldn't have gone anonymous so that we could properly attribute this most wonderful true sentence to you. Thumbs way up! :winky:
captcha: xevious - eh. I dislike SHMUPS.