• whicker (unregistered)

    I really despise these evangelical articles about regex. the original code did what it was supposed to do given the situation.

    This site is becoming the daily Whoops No Regex. (The Daily WNR)

  • Pelle (unregistered) in reply to Fuji
    Fuji :
    It could, but the eggs would always look like they are scrambled.
    Ha ha ha :-) Good one.
  • Jamie (unregistered) in reply to Pidgeot
    Pidgeot:
    XHTML is mostly useless in this day and age. Unless there's a *specific* need for something in XHTML (hint: there rarely is), it makes more sense to stick with HTML and use
    .

    And there we have it. A WTF winnah.

  • Pelle (unregistered) in reply to Tukaro
    Tukaro:
    You know, I'm fairly certain that Regular Expressions could cook breakfast for me if I could figure out the right sequence.
    Actually, regular expressions are quite limited. You can e.g. not use it to match x^ny^n (equal number of x and y).
  • (cs) in reply to whicker
    whicker:
    I really despise these evangelical articles about regex. the original code did what it was supposed to do given the situation.

    This site is becoming the daily Whoops No Regex. (The Daily WNR)

    Awww... it's ok. Other people didn't learn RegEx either, so you're not alone! :)

    -- Seejay (who did learn RegEx at her WTF University... at least they did something right there!)

  • (cs) in reply to Pelle
    Pelle:
    Tukaro:
    You know, I'm fairly certain that Regular Expressions could cook breakfast for me if I could figure out the right sequence.
    Actually, regular expressions are quite limited. You can e.g. not use it to match x^ny^n (equal number of x and y).
    Seriously, people who don't know that wouldn't be posting here. And if they are then they shouldn't be. I mean, seriously. This is pretty fundamental parsing.
  • Anon Fred (unregistered) in reply to Welbog
    Welbog:
    Seriously, people who don't know that wouldn't be posting here. And if they are then they shouldn't be. I mean, seriously. This is pretty fundamental parsing.

    Tedshade? Is that you?

  • ChiefCrazyTalk (unregistered) in reply to Tukaro

    thats not a WTF - the poor guy probably just didn't know regular expressions. Hopefully he learned something new.

  • Paul (unregistered) in reply to Gamen
    Gamen:
    As for a regex that would match most
    tags, /<br[^>]*>/i

    That's the closest one so far. Amazing how many people complain about Mickey's version and don't get it much better themselves.

    It still won't handle
    but the chances of that happening are slim (who uses titles for
    tags? id, clear, style and class attributes won't normally have '>' characters in their values).

    However

    <br([^'">]"[^"]"|[^'">]'[^']')[^'">]>/i

    is closer AFAICS

  • UTU (unregistered)
    XIU:
    Well I think "<br\s*/?>" would probably do it on most sites.
    Geoff:
    <\s*br\s*/\s*>
    

    And we can just hope that the regexp engine isn't running in greedy mode by default :)

    How about:

    <\s*?br\s*?[^>]*?>
    
  • UTU (unregistered) in reply to Paul
    Paul:
    It still won't handle
    but the chances of that happening are slim (who uses titles for
    tags? id, clear, style and class attributes won't normally have '>' characters in their values).

    You're not even allowed to have unescaped special characters within tags if memory serves? Of course... there's HTML and there's tagsoup.

  • (cs) in reply to UTU

    This post is an example of why I love regexp threads so much. (Forum won't let me quote it right.)

  • Ubersoldat (unregistered) in reply to whicker
    whicker:
    I really despise these evangelical articles about regex. the original code did what it was supposed to do given the situation.

    This site is becoming the daily Whoops No Regex. (The Daily WNR)

    Listen to me carefully, if for whatever reason I ever have to debug or read VB code (or any other) and found that kind of crap, I would be slapping someone to his/hers knees with the 100 pages of pain of the Regular Expressions Pocket Reference book. Or even better, I would download it to a hard drive and go all postal with it: "there, learn regexps while you bleed!"

    Brain Waffles with a hard drive.

  • VB man (unregistered)

    What type of VB are we talking -

    startBr=instr(searchFrom,lcase(html),"<br") endBr=instr(startBr,html,"/>") html=left(html,startBr-1) + vbcrlf + mid(html,endbr+3)

    also what about

    ?

  • html (unregistered)

    You guys are all the real WTF. Using regex to parse xml/html is just wrong. What if there's a
    embedded in an attribute that needs to be preserved? The correct solution is to use a real parser.

  • John Doe (unregistered) in reply to Gamen
    Gamen:
    As for a regex that would match most
    tags, /<br[^>]*>/i

    No, this replaces all tags which start with 'br', so also stuff like <break>, <bright>, <brillant>, etc. Better would be: /<br(\s[^>]*)?>/i which only matches if there is at least one whitespace, if there are any attributes or other junk in the tag.

    Of course this still wouldn't work if you use prefixes, like html:br/. Good luck with solving that with regexes ;)

  • Anon Fred (unregistered)

    I hope no one intends to use their own HTML tag, like <brisket>, because that would match most of the regex's that people have posted so far.

    Or if Netscape comes out with the <brown> tag, you're all dead.

  • Jonathan (unregistered) in reply to Tukaro

    it's a chance for him that the tag's name isn't longer than 2 characters...

  • Shinobu (unregistered)

    So what have we learned today? To reliably match BR-tags with a regex you need a regex that looks like line noise, that you don't really type or read like normal code but more like a sort of puzzle, that's very hard to check for correctness, and almost certainly misses a cornercase somewhere.

  • wtf (unregistered)

    i guess that RegEx.Replace returns the result, doesn't edit it in place (like in a byref parameter)

    so his "solution" is still broken :p

  • Kid (unregistered)

    Regex would probably be the best choice in this scenario, but IMO it's far too widely used when there are better (and faster) choices. I blame Perl.

  • wtf (unregistered) in reply to Paul
    Paul:
    Gamen:
    As for a regex that would match most
    tags, /<br[^>]*>/i

    That's the closest one so far. Amazing how many people complain about Mickey's version and don't get it much better themselves.

    It still won't handle
    but the chances of that happening are slim (who uses titles for
    tags? id, clear, style and class attributes won't normally have '>' characters in their values).

    However

    <br([^'">]"[^"]"|[^'">]'[^']')[^'">]>/i

    is closer AFAICS

    eh... no dice again... a major flaw of this regex is that it will match <breakfast> and such. i say, don't try over-engineering things... br doesn't need attributes, so don't bother supporting them.

  • Shill (unregistered) in reply to John Doe
    John Doe:
    Gamen:
    As for a regex that would match most
    tags, /<br[^>]*>/i

    No, this replaces all tags which start with 'br', so also stuff like <break>, <bright>, <brillant>, etc. Better would be: /<br(\s[^>]*)?>/i which only matches if there is at least one whitespace, if there are any attributes or other junk in the tag.

    Fails to detect
    .

  • Steve H. (unregistered)

    I only have 2 comments to make about this.

    1. I'd definitely be guilty of writing code like this. Yes, I learned some Regular Expressions, but I haven't used them in years, since not too many embedded systems use them. The question should be asked: What sort of code would you have written had you forgotten that regular expressions were an option?

    2. Judging from some of the comments I'm reading here, I'm not convinced that everyone else here understands regular expressions either. They're a good way to get things done, but it's also damn easy to shoot yourself in the foot hard.

  • pb (unregistered)

    I would say that the real WTF is that this isn't even a good way to do it without regex.

    a) DOM exists for a reason (not that it is the most performant solution).

    b) If DOM is banned, why not parse it LToR, break out into a tagname check on encountering "<"?

    The most interesting part is this:

    c)If, for some mysterious reason (use of DOM, regex and LToR parsing all banned), this really is the way it has to be done, Why 3 LCase() calls? Why not toLower the whole thing, store it, then do the three tests with that? I suppose that the extra two scans through the whole string must be pretty insignificant when faced with the all the various

    etc. tests.

    This points to the coder being essentially an untrained scripter who doesn't understand that the machine (on the whole) has to do as it is told, and that there is an impact to adding extraneous commands.

    However, if that is the case, Why bother with the three tests at all? They can only be there to "improve performance" - Why not just attempt the replaces anyway? Unless there is some significant chance that many of the input documents have no br tags at all, the performance improvement gained by the tests can't be that significant.

    So, I expect that the tests went in, because, after slowness complaints by a punter, one of his superiors took a look at the code and said "that function looks pretty intensive, perhaps you should put a test in, so it doesn't always do all those replacements".

    Don't blame the fresh-out-of nappies grunt coder for this WTF, blame his peter principle boss, who got "promoted" away from the code to somewhere where he couldn't do any harm.

  • Fish Basket Gordo (unregistered) in reply to bstorer
    bstorer:
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs. This looks more like a task one would assign to a junior developer, and provide some guidance to them, or a suggestion of what to use. Now, if you were to tell me this was done by a senior engineer, or provided some history of grandiose accomplishments from the perpetrator of this submission, perhaps I would be more awestruck by it's "wtf"-osity. As it stands, it just appears to be an individual unaware of one of the many tools available to a developer.
    What CS program did you have that didn't include regex? We had to learn the language theory and design our own regex engine.

    Indeed. I had to code a regex engine, along with push down automaton engine and Turing machine engine for my comp sci degree. Besides, isn't the root of many, if not most, WTFs a developer who is ignorant of common tools?

  • TANJ (unregistered) in reply to ahnfelt
    ahnfelt:
    Pez:
    bstorer:
    strTagLess = Replace(strTagLess, "
    ", vbCrLf)

    If anybody used
    around me, I'd shoot 'em.

    If anyone used anything other than
    around me, I'd shoot them...

    Captcha: atari - Old Skool!

    If anyone used anything other than
    around me, I'd shoot them.

    Would that be a BReak dance? :-)

  • GrandmasterB (unregistered) in reply to html
    html:
    You guys are all the real WTF. Using regex to parse xml/html is just wrong. What if there's a
    embedded in an attribute that needs to be preserved? The correct solution is to use a real parser.

    Good point Grasshoppa.

    What if the text is "All are invited <Brunch is at noon>" or "Use
    to show a line break". In the end it'd depend on what the data being parsed looked like.

  • the Winner is... (unregistered) in reply to Shill

    /<\sbr([^\w>][^>])?>/i

    And with quoted > allowed:

    /<\sbr(^\w>'")?>/i

    Captcha: ewww... Exactly what this last regexp looks like.

  • (cs) in reply to Anon Fred
    Anon Fred:
    bstorer:
    What CS program did you have that didn't include regex? We had to learn the language theory and design our own regex engine.
    I know they're a bit of an outlier, but you can get a CS degree from MIT without taking the compilers class where they go over the theory of regexp's.

    For that matter, you can also get through without learning C or Ruby or PHP or Perl or Python or JavaScript.

    (Most students learned about them on their own time. It's the ones that didn't that you really need to watch out for.)

    Now that's a WTF. My CS program made Organization of Programming Languages a required class, covering regex (and grammars, and state automata, and so forth), basics of parsing and compiling, call stacks, multi-threading, blah, blah, blah. And then we had to go and implement it in a collection of various languages.

  • (cs) in reply to cBradley
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs. This looks more like a task one would assign to a junior developer, and provide some guidance to them, or a suggestion of what to use.

    I really find it hard to believe that a legitimate CS curriculum would not cover regular expressions at some point. Granted, the only language-specific class I encoutnered that covered regular expressions in any detail was an elective "intro to perl" class.

    However, I had a required "theory of comptutation" class during my senior year that thoroughly covered regular languages and the equivalence of DFAs, NFAs and Regular Expressions. It also went far beyond that into the realm of Context free languages and turning machines. After that, I had a "programming languages" class where we had to write an interpreter for a language the professor made up. The first step was to break the code up into tokens (the scanner/lexer). Without using regular expressions, or at least emulating a DFA in code (that's what I wound up doing for reasons that I can't remember), that would have been pretty damn hard. I would think that most, if not all, CS programs require their students to write a compiler/interpreter at some point. That would seem really hard to do if you didn't have an understanding of "regular languages" on some level.

    If we're talking about a 2-year associates degree in "programming" or something, then I could see them not covering regular languages or regular expressions at all...If somebody got through 4-year university CS program without being exposed to them, then I think they got cheated.

  • Salty (unregistered) in reply to Michael
    Michael:
    Salty:
    Pyro:
    why bother fixing? using VB is the real WTF anyway :)

    That opinion is so 1990's.

    So is VB.
    It's much more modern than C++.

  • (cs) in reply to cBradley
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs. This looks more like a task one would assign to a junior developer, and provide some guidance to them, or a suggestion of what to use. Now, if you were to tell me this was done by a senior engineer, or provided some history of grandiose accomplishments from the perpetrator of this submission, perhaps I would be more awestruck by it's "wtf"-osity. As it stands, it just appears to be an individual unaware of one of the many tools available to a developer.
    <rant>

    CS/CE majors who haven't figured out regex's, various unix utilities, and other "tools of the trade" that aren't taught in class are a big enough WTF on their own. I definitely agree that it is unremarkable that a typical CS major would not know regex's, however I think that one who doesn't take the time to learn such things will become a quite unremarkable developer who will spend many years writing code drenched in WTFery. For the record, regex's were covered in several CS courses that I took.

    On a slightly unrelated note, I found during my senior year that I knew a considerable number of people who were ready to graduate with CS degrees and pretty good GPA's who didn't know what pointers were, and in fact knew absolutely nothing about memory management. I think this is thanks to Java being the only practical language taught in many CS curriculums today. I think everyone should start with C, move to C++, and then once the fundamentals of programming and good practice have become ingrained, students should be allowed to use Java or other managed languages to simplify their lives.

    </ RaNt>

  • Jerome (unregistered) in reply to Tukaro
    Tukaro:
    You know, I'm fairly certain that Regular Expressions could cook breakfast for me if I could figure out the right sequence.

    Sure, but if you used a greedy regex it would eat it too.

  • (cs) in reply to Troche
    Troche:
    One of the things I have run into is that, though it may be taught in CS programs it wasn't taught in my program. My degree is in CIS(Computer Information Systems), throwing that information systems in there evidently gave them license to just skip huge sections of my education.

    CIS is a completely different degree than Computer Science. At most universities, it's more business-oriented then technically oriented.

    My school's version of this was called "MIS" and I majored in it for two years before switching to CS. It was far too easy for my tastes and I didn't feel like I got much out of it, hence the switch.

  • (cs) in reply to XIU
    XIU:
    xtremezone:
    Though technically the browser probably wouldn't care what came between
    so the regular expression should probably account for any number of anything, assuming there is at least one whitespace character between
    . :-/




    Well I think "<br\s*/?>" would probably do it on most sites.

    <\sbr[^>]>

  • Gabriel (unregistered) in reply to dkf
    dkf:
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs.
    You mean there's a lot of places affiliated with WTF-U's programme? What are they teaching instead, underwater basket-weaving?

    At UC Irvine, we covered state machines, finite automata, and TOUCHED ON regular expressions in one of our MATH Classes... but didn't really have much coding coursework related to it. I am profoundly grateful for my part-time job at the time which introduced me to Perl and regular expressions way before my time. =)

    And... while regular expressions might cook you eggs in the morning, it might look as if the chicken were drunk. (er ... guess I'm mixing my metaphors there. ;))

  • (cs) in reply to bstorer
    bstorer:
    What CS program did you have that didn't include regex? We had to learn the language theory and design our own regex engine.
    And in the Games Programming course I took at uni, we had to write our own (from scratch): A* pathfinder, software 3D renderer, basic physics engine, particle system, and multiplayer network engine. All in C/C++.

    I'm not kidding. That was a very good course.

    We also learnt HTML/XHTML, perl, PHP, 68000 assembler, and low-level computer architecture (transistors FTW), along with appropriate maths (Matrices, quaternions etc).

    It was hard to find a good games programming course, most of the places I looked thought that Java or Flash was a good language to base a games programming course around. Those courses wouldn't have got me working with UE3 and an Xbox360 dev kit. Instead I would have been working in a group of maybe 5 people churning out "new" clones of old 2D games for mobile phones.

  • iMalc (unregistered)

    I don't suppose it was written by the same person that made all the #defines for cout cOut cOuT...

  • Gabriel (unregistered) in reply to Gabriel
    Gabriel:
    At UC Irvine, we covered state machines, finite automata, and TOUCHED ON regular expressions in one of our MATH Classes... but didn't really have much coding coursework related to it.

    I thought I'd expand on this.

    The material was "covered in some form" -- we just didn't have any required coursework that required us to DO much with it,except perhaps a homework assignment I might be forgetting. Our compilers class was optional, but even that I don't think had us using regular expressions. We did have to implement a token recognition framework, and my knowledge of regexes made understanding BNF/EBNF easier ... but it was never tied in-class to something that there were language features out there already supporting it.

    Thank god for Perl, or I'd have had to learn regexes the hard way. ;)

  • Zygo (unregistered) in reply to Christian
    Christian:
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs.

    They're almost certainly taught in 99.999% of courses on compilers. Aren't compiler courses still taught in a majority of CS programs?

    Regexes are an obvious product of the study of DFA's, grammars, Turing machines and the like, so it would be a real WTF to have a CS course without regexes.

    In my CS program regexes were part of a mandatory 2nd year class (do basic RE's with ?, +, and * operators, implement an LL(1) and LR(1) parser, DFA theory, Turing machines, etc).

    Compilers were an optional 4th year class (the language to be compiled was Ada IIRC).

    I'm guessing that most of the technical colleges (the ones with trademarked product names in their course titles) barely teach CS theory at all, so it wouldn't surprise me to learn that they ignore regexps.

  • (cs)

    The real WTF is that to replace
    in all the correct places, you need a fairly complete HTML parser, not a string replace or a regexp will do.

    The
    could be in a comment, some embedded code, or inside a or

     block.

  • (cs)

    Where did those goggles go ?

  • Zygo (unregistered) in reply to Tukaro
    Tukaro:
    You know, I'm fairly certain that Regular Expressions could cook breakfast for me if I could figure out the right sequence.

    If you place your breakfast on your CPU, then use the right combination of minimal-matching modifiers on a long non-matching string, then yes, regular expressions can cook your breakfast for you.

    For example, the regexp '(a+)*\d' with a string length of 1GB of 'B's raises my CPU temperature by 7 degrees, and my CPU fan to get sufficiently noisy to attract the attention of my co-workers from neighboring cubes (hi Mike!).

  • Robert S. Robbins (unregistered) in reply to bstorer

    Holy crap! You obviously did not go to community college like I did. I bet you had to learn a lot of math.

  • Zygo (unregistered) in reply to Paul
    Paul:
    Gamen:
    As for a regex that would match most
    tags, /<br[^>]*>/i

    That's the closest one so far. Amazing how many people complain about Mickey's version and don't get it much better themselves.

    Most of the postings here are just trying to do more or less exactly what the original code did, but more concisely and with some really obvious flaws fixed.

    Of course the real WTF is that he's not using a full HTML parser library, which would get the syntax and grammar right, but probably also have unpredictable runtimes, buffer overrun bugs, miscellaneous security issues, many error cases to handle, much larger memory and CPU overheads, maybe even outrageous licensing fees, and at the end of the day probably spend the rest of its service life parsing inputs that consist only of very similar text with the occasional
    tag.

    For all we know, this code might read from a corporate intranet's internal database web service, and the only possible tags it might ever see are exactly the set "
    " and "
    ".

    Paul:
    It still won't handle
    but the chances of that happening are slim

    My cell phone's web browser saw a web page with a line very similar to that once.

    It turned the phone into a brick.

    To be fair, the phone's browser would crash (automatically resetting the phone to power-on state) on any XHTML tag, so there were fairly craaazy firmware bugs to start with. That particular construct seemed to hit the triple-word-score square on the Scrabble board of code bugginess, and wrote something to flash memory that caused the same crash during the power-on sequence.

    One expensive firmware-reflashing (and upgrade) later, now the web browser pops up an error dialog ("out of memory") on that line of HTML, but doesn't crash the phone.

    But I digress.

    Shame! Shame on people for not writing full table-driven recursive parsers in their one-off forum postings. Shame!

  • (cs) in reply to dkf
    dkf:
    cBradley:
    As awesome as regular expressions are, they aren't taught in most Comp Sci programs.
    You mean there's a lot of places affiliated with WTF-U's programme? What are they teaching instead, underwater basket-weaving?

    At academic institutions, they focus on theories and knowledge.

    At vocatiocal institutions, they focus on practice and experience.

    Most real-life institutions offer some mix of academic and vocational study.

  • AGould (unregistered) in reply to Pyro
    Pyro:
    why bother fixing? using VB is the real WTF anyway :)

    Alas, some of us don't get a choice. My work requires me to use VBA (usually Access) because it's the only tool we can use in-house. Anything more complicated requires head office approval, a mountain of specs, and they wouldn't let us build it anyway.

  • (cs) in reply to the Winner is...
    the Winner is...:
    /<\s*br([^\w>][^>]*)?>/i

    And with quoted > allowed:

    /<\sbr(^\w>'")?>/i

    Captcha: ewww... Exactly what this last regexp looks like.

    Which character class matches annoying captcha posts??

  • (cs)

    Even without RegExps there is a slightly nicer solution that would have been as reliable as the origial code and didn't use any special concepts. Simply looking up the usage of the Replace function would have revealed the vbTextCompare flag. Storing the most likely variations in an array also removes a few lines from the code. I could be wrong in my reasoning, but it seems a waste to search the string 3 times first for matches before searching again for the actual text replacement. It seems it might be more efficient to just call Replace.

    Dim i
    Dim strBrArray:     strBrArray = Array("
    ", "
    ", "
    ") For i = 0 To UBound(strBrArray) strTagLess = Replace(strTagLess, strBrArray(i), vbCrLf, 0, -1, vbTextCompare) Next

    The above code is not guaranteed to work, nor is it guaranteed to be free of WTFs. :P

    Addendum (2007-10-31 16:54): Note: my code snippet is VBScript and the original appears to have been VB. :P There aren't too many differences, though there are some. It doesn't really change the solution much, however.

Leave a comment on “Breaking Broken”

Log In or post as a guest

Replying to comment #:

« Return to Article