- Feature Articles
- CodeSOD
- Error'd
- 
                
                    Forums 
- 
                Other Articles
                - Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
 
 
            
Admin
I really despise these evangelical articles about regex. the original code did what it was supposed to do given the situation.
This site is becoming the daily Whoops No Regex. (The Daily WNR)
Admin
Admin
And there we have it. A WTF winnah.
Admin
Admin
Awww... it's ok. Other people didn't learn RegEx either, so you're not alone! :)
-- Seejay (who did learn RegEx at her WTF University... at least they did something right there!)
Admin
Admin
Tedshade? Is that you?
Admin
thats not a WTF - the poor guy probably just didn't know regular expressions. Hopefully he learned something new.
Admin
That's the closest one so far. Amazing how many people complain about Mickey's version and don't get it much better themselves.
It still won't handle
but the chances of that happening are slim (who uses titles for
tags? id, clear, style and class attributes won't normally have '>' characters in their values).
However
<br([^'">]"[^"]"|[^'">]'[^']')[^'">]>/i
is closer AFAICS
Admin
And we can just hope that the regexp engine isn't running in greedy mode by default :)
How about:
Admin
You're not even allowed to have unescaped special characters within tags if memory serves? Of course... there's HTML and there's tagsoup.
Admin
This post is an example of why I love regexp threads so much. (Forum won't let me quote it right.)
Admin
Listen to me carefully, if for whatever reason I ever have to debug or read VB code (or any other) and found that kind of crap, I would be slapping someone to his/hers knees with the 100 pages of pain of the Regular Expressions Pocket Reference book. Or even better, I would download it to a hard drive and go all postal with it: "there, learn regexps while you bleed!"
Brain Waffles with a hard drive.
Admin
What type of VB are we talking -
startBr=instr(searchFrom,lcase(html),"<br") endBr=instr(startBr,html,"/>") html=left(html,startBr-1) + vbcrlf + mid(html,endbr+3)
also what about
?
Admin
You guys are all the real WTF. Using regex to parse xml/html is just wrong. What if there's a
embedded in an attribute that needs to be preserved? The correct solution is to use a real parser.
Admin
No, this replaces all tags which start with 'br', so also stuff like <break>, <bright>, <brillant>, etc. Better would be: /<br(\s[^>]*)?>/i which only matches if there is at least one whitespace, if there are any attributes or other junk in the tag.
Of course this still wouldn't work if you use prefixes, like html:br/. Good luck with solving that with regexes ;)
Admin
I hope no one intends to use their own HTML tag, like <brisket>, because that would match most of the regex's that people have posted so far.
Or if Netscape comes out with the <brown> tag, you're all dead.
Admin
it's a chance for him that the tag's name isn't longer than 2 characters...
Admin
So what have we learned today? To reliably match BR-tags with a regex you need a regex that looks like line noise, that you don't really type or read like normal code but more like a sort of puzzle, that's very hard to check for correctness, and almost certainly misses a cornercase somewhere.
Admin
i guess that RegEx.Replace returns the result, doesn't edit it in place (like in a byref parameter)
so his "solution" is still broken :p
Admin
Regex would probably be the best choice in this scenario, but IMO it's far too widely used when there are better (and faster) choices. I blame Perl.
Admin
Admin
Fails to detect
.
Admin
I only have 2 comments to make about this.
I'd definitely be guilty of writing code like this. Yes, I learned some Regular Expressions, but I haven't used them in years, since not too many embedded systems use them. The question should be asked: What sort of code would you have written had you forgotten that regular expressions were an option?
Judging from some of the comments I'm reading here, I'm not convinced that everyone else here understands regular expressions either. They're a good way to get things done, but it's also damn easy to shoot yourself in the foot hard.
Admin
I would say that the real WTF is that this isn't even a good way to do it without regex.
a) DOM exists for a reason (not that it is the most performant solution).
b) If DOM is banned, why not parse it LToR, break out into a tagname check on encountering "<"?
The most interesting part is this:
c)If, for some mysterious reason (use of DOM, regex and LToR parsing all banned), this really is the way it has to be done, Why 3 LCase() calls? Why not toLower the whole thing, store it, then do the three tests with that? I suppose that the extra two scans through the whole string must be pretty insignificant when faced with the all the various
etc. tests.
This points to the coder being essentially an untrained scripter who doesn't understand that the machine (on the whole) has to do as it is told, and that there is an impact to adding extraneous commands.
However, if that is the case, Why bother with the three tests at all? They can only be there to "improve performance" - Why not just attempt the replaces anyway? Unless there is some significant chance that many of the input documents have no br tags at all, the performance improvement gained by the tests can't be that significant.
So, I expect that the tests went in, because, after slowness complaints by a punter, one of his superiors took a look at the code and said "that function looks pretty intensive, perhaps you should put a test in, so it doesn't always do all those replacements".
Don't blame the fresh-out-of nappies grunt coder for this WTF, blame his peter principle boss, who got "promoted" away from the code to somewhere where he couldn't do any harm.
Admin
Indeed. I had to code a regex engine, along with push down automaton engine and Turing machine engine for my comp sci degree. Besides, isn't the root of many, if not most, WTFs a developer who is ignorant of common tools?
Admin
Would that be a BReak dance? :-)
Admin
Good point Grasshoppa.
What if the text is "All are invited <Brunch is at noon>" or "Use
to show a line break". In the end it'd depend on what the data being parsed looked like.
Admin
/<\sbr([^\w>][^>])?>/i
And with quoted > allowed:
/<\sbr(^\w>'")?>/i
Captcha: ewww... Exactly what this last regexp looks like.
Admin
Admin
I really find it hard to believe that a legitimate CS curriculum would not cover regular expressions at some point. Granted, the only language-specific class I encoutnered that covered regular expressions in any detail was an elective "intro to perl" class.
However, I had a required "theory of comptutation" class during my senior year that thoroughly covered regular languages and the equivalence of DFAs, NFAs and Regular Expressions. It also went far beyond that into the realm of Context free languages and turning machines. After that, I had a "programming languages" class where we had to write an interpreter for a language the professor made up. The first step was to break the code up into tokens (the scanner/lexer). Without using regular expressions, or at least emulating a DFA in code (that's what I wound up doing for reasons that I can't remember), that would have been pretty damn hard. I would think that most, if not all, CS programs require their students to write a compiler/interpreter at some point. That would seem really hard to do if you didn't have an understanding of "regular languages" on some level.
If we're talking about a 2-year associates degree in "programming" or something, then I could see them not covering regular languages or regular expressions at all...If somebody got through 4-year university CS program without being exposed to them, then I think they got cheated.
Admin
Admin
CS/CE majors who haven't figured out regex's, various unix utilities, and other "tools of the trade" that aren't taught in class are a big enough WTF on their own. I definitely agree that it is unremarkable that a typical CS major would not know regex's, however I think that one who doesn't take the time to learn such things will become a quite unremarkable developer who will spend many years writing code drenched in WTFery. For the record, regex's were covered in several CS courses that I took.
On a slightly unrelated note, I found during my senior year that I knew a considerable number of people who were ready to graduate with CS degrees and pretty good GPA's who didn't know what pointers were, and in fact knew absolutely nothing about memory management. I think this is thanks to Java being the only practical language taught in many CS curriculums today. I think everyone should start with C, move to C++, and then once the fundamentals of programming and good practice have become ingrained, students should be allowed to use Java or other managed languages to simplify their lives.
</ RaNt>
Admin
Sure, but if you used a greedy regex it would eat it too.
Admin
CIS is a completely different degree than Computer Science. At most universities, it's more business-oriented then technically oriented.
My school's version of this was called "MIS" and I majored in it for two years before switching to CS. It was far too easy for my tastes and I didn't feel like I got much out of it, hence the switch.
Admin
<\sbr[^>]>
Admin
At UC Irvine, we covered state machines, finite automata, and TOUCHED ON regular expressions in one of our MATH Classes... but didn't really have much coding coursework related to it. I am profoundly grateful for my part-time job at the time which introduced me to Perl and regular expressions way before my time. =)
And... while regular expressions might cook you eggs in the morning, it might look as if the chicken were drunk. (er ... guess I'm mixing my metaphors there. ;))
Admin
I'm not kidding. That was a very good course.
We also learnt HTML/XHTML, perl, PHP, 68000 assembler, and low-level computer architecture (transistors FTW), along with appropriate maths (Matrices, quaternions etc).
It was hard to find a good games programming course, most of the places I looked thought that Java or Flash was a good language to base a games programming course around. Those courses wouldn't have got me working with UE3 and an Xbox360 dev kit. Instead I would have been working in a group of maybe 5 people churning out "new" clones of old 2D games for mobile phones.
Admin
I don't suppose it was written by the same person that made all the #defines for cout cOut cOuT...
Admin
I thought I'd expand on this.
The material was "covered in some form" -- we just didn't have any required coursework that required us to DO much with it,except perhaps a homework assignment I might be forgetting. Our compilers class was optional, but even that I don't think had us using regular expressions. We did have to implement a token recognition framework, and my knowledge of regexes made understanding BNF/EBNF easier ... but it was never tied in-class to something that there were language features out there already supporting it.
Thank god for Perl, or I'd have had to learn regexes the hard way. ;)
Admin
Regexes are an obvious product of the study of DFA's, grammars, Turing machines and the like, so it would be a real WTF to have a CS course without regexes.
In my CS program regexes were part of a mandatory 2nd year class (do basic RE's with ?, +, and * operators, implement an LL(1) and LR(1) parser, DFA theory, Turing machines, etc).
Compilers were an optional 4th year class (the language to be compiled was Ada IIRC).
I'm guessing that most of the technical colleges (the ones with trademarked product names in their course titles) barely teach CS theory at all, so it wouldn't surprise me to learn that they ignore regexps.
Admin
The real WTF is that to replace
in all the correct places, you need a fairly complete HTML parser, not a string replace or a regexp will do.
The
could be in a comment, some embedded code, or inside a or
Admin
Where did those goggles go ?
Admin
If you place your breakfast on your CPU, then use the right combination of minimal-matching modifiers on a long non-matching string, then yes, regular expressions can cook your breakfast for you.
For example, the regexp '(a+)*\d' with a string length of 1GB of 'B's raises my CPU temperature by 7 degrees, and my CPU fan to get sufficiently noisy to attract the attention of my co-workers from neighboring cubes (hi Mike!).
Admin
Holy crap! You obviously did not go to community college like I did. I bet you had to learn a lot of math.
Admin
Most of the postings here are just trying to do more or less exactly what the original code did, but more concisely and with some really obvious flaws fixed.
Of course the real WTF is that he's not using a full HTML parser library, which would get the syntax and grammar right, but probably also have unpredictable runtimes, buffer overrun bugs, miscellaneous security issues, many error cases to handle, much larger memory and CPU overheads, maybe even outrageous licensing fees, and at the end of the day probably spend the rest of its service life parsing inputs that consist only of very similar text with the occasional
tag.
For all we know, this code might read from a corporate intranet's internal database web service, and the only possible tags it might ever see are exactly the set "
" and "
".
My cell phone's web browser saw a web page with a line very similar to that once.
It turned the phone into a brick.
To be fair, the phone's browser would crash (automatically resetting the phone to power-on state) on any XHTML tag, so there were fairly craaazy firmware bugs to start with. That particular construct seemed to hit the triple-word-score square on the Scrabble board of code bugginess, and wrote something to flash memory that caused the same crash during the power-on sequence.
One expensive firmware-reflashing (and upgrade) later, now the web browser pops up an error dialog ("out of memory") on that line of HTML, but doesn't crash the phone.
But I digress.
Shame! Shame on people for not writing full table-driven recursive parsers in their one-off forum postings. Shame!
Admin
At academic institutions, they focus on theories and knowledge.
At vocatiocal institutions, they focus on practice and experience.
Most real-life institutions offer some mix of academic and vocational study.
Admin
Alas, some of us don't get a choice. My work requires me to use VBA (usually Access) because it's the only tool we can use in-house. Anything more complicated requires head office approval, a mountain of specs, and they wouldn't let us build it anyway.
Admin
Which character class matches annoying captcha posts??
Admin
Even without RegExps there is a slightly nicer solution that would have been as reliable as the origial code and didn't use any special concepts. Simply looking up the usage of the Replace function would have revealed the vbTextCompare flag. Storing the most likely variations in an array also removes a few lines from the code. I could be wrong in my reasoning, but it seems a waste to search the string 3 times first for matches before searching again for the actual text replacement. It seems it might be more efficient to just call Replace.
The above code is not guaranteed to work, nor is it guaranteed to be free of WTFs. :P
Addendum (2007-10-31 16:54): Note: my code snippet is VBScript and the original appears to have been VB. :P There aren't too many differences, though there are some. It doesn't really change the solution much, however.