• Will Perdikakis (unregistered)

    I figured out how to make addition work in C.

    //This solves A + B. int add(int A, int B){ if (A == 1){ if (B == 0) return 1; if (B == 1) return 2; if (B == 2) return 3; if (B == 3) return 4; if (B == 4) return 5; if (B == 5) return 6; if (B == 6) return 7; if (B == 7) return 8; if (B == 8) return 9; ... }}

    Feel free to use the above code. I am releasing it under GPL as soon as it is approved.

    Thank you.

  • baron5038 (unregistered) in reply to ObiWayneKenobi
    ObiWayneKenobi:
    Oh good sweet god. Love how it replaces the tags.

    And.. I don't get the deal with the Exceptions. It's bad to throw an exception in a try/catch block? But it's even worse to go: throw new Exception? So... what the hell do you do with the exception? I mean, isn't the proper error handling to throw it, so the calling method (in the presentation layer) can also implement try/catch and then display the error? You can't display the error in the business logic or data logic layers, so what more can you do except throw the exception?

    If that's what you want to do, then don't catch the exception in that function at all in the first place.

    Data Access Layer --> throws exception Business Logic Layer --> no try/catch whatsoever UI Layer --> catches exception and displays error message to user

    If a function needs to do some cleaning up in case an exception tears through it, then use try/finally with no catch. If you want to do something with the exception itself, but still rethrow it e.g. to a UI layer, use 'throw' by itself to preserve the stack trace.

  • (cs)

    In a slightly non-standard dialect of RegExp (in that capture groups are identified by {}, \s indicates any whitespace character, and angle brackets must be escaped by backslashes), I have tested this: <{[^\s>]+}>|<{[^\s>]+}[^>]+>

    If you lowercase the capture group on each match, it works.

    Even for code like:

    [image]
    

  • Sure, it fails in a a few remaining corner-cases:

     

    Of course, I don't expected that attributes with a ">" followed by a "<" occur very often in regular usage.

    I also believe that (while entirely possible) extending a Regular Expression to catch valid HTML attributes and filter them out would be excessive.

  • (cs) in reply to anon
    anon:
    Dan:
    I think this is by far the biggest WTF out of this whole topic.

    Actually I think the bigger WTF is that you (and others) taking such an obvious troll seriously.

    (come on - Perl, "line noise", scrabble tiles, I couldn't have been much more obvious)

    It never ceases to me amuse me how most of the commenters on this site can't spot deadpan sarcasm, even though it's the "house tone" of the entire site

    I for one laughed so hard I cried....thank you.

  • (cs) in reply to Chris D
    Chris D:
    I also think a few people are missing that this isn't the HTML parser. This is just a hack preprocessor for what we can only assume is an awful HTML parser.

    Regexs are difficult to use correctly for parsing HTML because of the greedy nature of most regexs. It's not impossible .. but hardly the easiest way.

    Greedy: * Not greedy: *?

    Greed is definitely not the problem here.

  • Dan (unregistered) in reply to anon
    anon:
    Dan:
    I think this is by far the biggest WTF out of this whole topic.

    Actually I think the bigger WTF is that you (and others) taking such an obvious troll seriously.

    (come on - Perl, "line noise", scrabble tiles, I couldn't have been much more obvious)

    It never ceases to me amuse me how most of the commenters on this site can't spot deadpan sarcasm, even though it's the "house tone" of the entire site

    ha.. yeah, admittedly, I didn't read the whole comment (missed the scrabble part). But reading back over it, the sarcasm is fairly obvious.

    I hadn't had my morning coffee yet, give me a break :)

  • PS (unregistered)

    Do the Html tags get a happy ending after their massage?

  • baron5038 (unregistered) in reply to savar
    savar:
    No, most HTML parsers are written with state machines. Regexes are totally inappropriate for most HTML operations.

    Regex handlers are (finite) state machines.

    An HTML parser would (probably, I think) be a push-down automaton (effectively a finite state machine with a stack). But the HTML would have to be perfect. Any program which has to parse HTML robustly would be a horrible mess of special cases and ugliness.

  • (cs) in reply to grg
    grg:
    Ahem, I havent seen anybody menthing this so far, but IMHO the beiiger WTF is thinking you can parse HTML with regular expressions.

    You were kidding, right? There are about 40 posts preceding yours on that exact topic.

  • (cs) in reply to baron5038
    baron5038:
    If that's what you want to do, then don't catch the exception in that function at all in the first place.

    Data Access Layer --> throws exception Business Logic Layer --> no try/catch whatsoever UI Layer --> catches exception and displays error message to user

    If a function needs to do some cleaning up in case an exception tears through it, then use try/finally with no catch. If you want to do something with the exception itself, but still rethrow it e.g. to a UI layer, use 'throw' by itself to preserve the stack trace.

    Okay.. so it's not so much the catching of the Exception as the method where it was done. Gotcha. That makes sense. Thanks!

  • Stof (unregistered) in reply to vertagano
    vertagano:
    In a slightly non-standard dialect of RegExp (in that capture groups are identified by {}, \s indicates any whitespace character, and angle brackets must be escaped by backslashes), I have tested this: \<{[^\s\>]+}\>|\<{[^\s\>]+}[^\>]+\>

    If you lowercase the capture group on each match, it works.

    Even for code like:

    [image]
    

  • Sure, it fails in a a few remaining corner-cases:

     

    Of course, I don't expected that attributes with a ">" followed by a "<" occur very often in regular usage.

    I also believe that (while entirely possible) extending a Regular Expression to catch valid HTML attributes and filter them out would be excessive.

    But this is the WTF site. Saying "it shouldn't fail often" is considered a bad thing isn't it? :)

  • Rich (unregistered)
        Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    
        -- Jamie Zawinski 
    

    Spoken in response to a request on how to use RegEx'es for HTML.

  • DoOver (unregistered) in reply to skington

    [quote user="skington"]Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.

    Try to write a regexp to match the content of a tag inside a

  • RON (unregistered) in reply to savar
    savar:
    The real WTF (besides having to scroll to the bottom to reply to the first post) is this:
    He wanted to learn more about regular expressions and, realizing that all HTML tools are probably written with regular expressions, decided to dig into one.
    No, most HTML parsers are written with state machines. Regexes are totally inappropriate for most HTML operations.

    In the given example Regex is okay becase you're really doing a blind find-and-replace operation without worry about validity of HTML.

    Still, I would wager money that the "fix" to this code involves some WTFs that would be totally abused by malformed HTML.

    A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.

  • RON (unregistered) in reply to Rich
    Rich:
        Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    
    -- Jamie Zawinski 
    

    Spoken in response to a request on how to use RegEx'es for HTML.

    I've heard that quote before. Except it wasn't with regular expressions, it was with XML.

    Anyone who thinks using a regexp is a problem is probably using it for something they're not intended for.

  • (cs) in reply to RON
    RON:

    A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.

    If by that you mean you can do anything with a regexp - well yes. But you can equally do anything with brainfuck. Just because you can do a certain job in any given language doesn't mean you should.

  • Opie (unregistered) in reply to Stof
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    No. Any proper xml parser would barf at that one, too, for precisely that reason: > is a reserved character and must be escaped in the string. So the fact that the regexp processor would fail is not limited solely to regexps. Doing it via a character by character string match would also fail unless your parser craps on the standards for parsing of HTML.

  • Opie (unregistered) in reply to Opie

    Also, there's no reason an XHTML 1.1-compliant parser could not use regular expressions and a stack to push and pop the most recent tags to/from.

    It would require that the document be perfect, but then again XHTML 1.1 requires the document to be perfect already, so what's the problem?

    It would actually probably be far more efficient and is probably (by my reckoning) one of the reasons behind that part of the standard.

  • RON (unregistered) in reply to m0ffx
    m0ffx:
    RON:

    A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.

    If by that you mean you can do anything with a regexp - well yes. But you can equally do anything with brainfuck. Just because you can do a certain job in any given language doesn't mean you should.

    You're missing the point. A regular expression literally is a state machine. They were designed to describe regular languages. In fact every Regex implementation I've ever seen compiles it down to a state machine, so the statement that it's better to use a state machine than a regex for parsing HTML is purely nonsensical.

  • Opie (unregistered) in reply to RON
    RON:
    You're missing the point. A regular expression literally is a state machine. They were designed to describe regular languages. In fact every Regex implementation I've ever seen compiles it down to a state machine, so the statement that it's better to use a state machine than a regex for parsing HTML is purely nonsensical.

    EXACTLY. Thank you...

    If XML isn't a regular language, I don't know what is... Just because people write crappy HTML doesn't mean the parser is wrong. There is a reason there are HTML designer tools - it's to eliminate user error. If someone writes a bad HTML document because they did it manually and made a typo, it's their own damn fault. The programmer of the parser should not be responsible for errors made by an arrogant HTML "coder" who feels he's too good for WYSIWYG.

    If you write good HTML in notepad, then more power to you. Just don't complain when a parser can't read a document you created because of a misplaced quote, tag, or brace.

    It's a perfect example of knowing enough to be dangerous. Use the tool - it'll save you time, plus they all let you screw with the raw html anyway...

  • (cs)

    Bloody hell, people. I'm no regex expert, but even I know that you DO NOT use regexs to parse HTML. Every informed opinion I've read on the subject has been exceedingly clear on this point.

  • kanna (unregistered)

    I'm not saying using REGEX with HTML is good or bad, and it's not really something I would do (although, I do work woth a specific case that I inherited where it's looking for a VERY specific line in an HTML document, not so much for parsing tags though). But this REGEX (I'm fairly new to using them) seems to get the first part of the tag into $1 and then the first quoted string into $2 and the rest in $3 (sorry, worked it out in Perl)... could probably be used in a higher-level language to create something. Maybe a recursive function to keep processing $3 until it's empty? I dunno. Just a starting point.

    /<([a-z0-9:=;/! ]+)"([^"]+)"([^>]+)>/i

  • kanna (unregistered) in reply to kanna
    kanna:
    I'm not saying using REGEX with HTML is good or bad, and it's not really something *I* would do (although, I do work woth a specific case that I inherited where it's looking for a VERY ...

    Never mind, that was pretty sloppy. Oh well, that's what I guess for trying to rush through in a programming language I only use rarely while trying to implement RegEx which I don't use that often either.

  • (cs) in reply to Opie
    Opie:
    If XML isn't a regular language, I don't know what is...
    You don't know what a regular language is.

    The simplest non-regular language that we all know and love is the language of all palindromes. It is not regular because regular expressions cannot (at least in the truest sense) remember the past. Once a part of a string is read by a regular expression, it is lost and cannot be compared to anything later in the string. Since a palindrome requires the beginning of a string to be compared to the end, it cannot be regular. (Yes, I know this isn't a rigorous proof, but I don't want to go through the pumping lemma for this. A high-level explanation should be enough).

    Now think about well-formed XML. Each element has to be opened and then closed: <element>[inner stuff]</element>. Now think about this as a palindrome, <element> has to be read and then later campared to </element> for it to be valid. The state machine has to know that the first element as <element> in order for </element> to be valid. But the best a regular expression can do is know that there was a well-formed element of indeterminate name. And so, by this hand-wavey (and completely correct) argument, XML is not a regular language. It's context-free, like palindromes. You need a stack alongside a simple regular lexer in order to validate it.

  • Sigivald (unregistered) in reply to Rich

    Wasn't "Some people, when..." originally sed, specifically, not just regular expressions?

    Still true, of course.

  • htg (unregistered)

    Everybody is so excited about using regular expressions for this...

    Just iterate through the html text (the char []) with two booleans - inTag and inQuote. When you encounter '<' set inTag. If you are inTag and encounter a " or ', negate inQuote. When you encounter a '>' and inQuote==false, unset inTag. Otherwise lowercase the character if inTag && !inQuote. It's O(n) and you can do other things like build the DOM tree whilst iterating (with a little more work of course, but it is the logical thing to do at this stage).

    Or you could have an enum of parser states because at some point you know will have umpteen booleans and that's tatty. The above doesn't handle unquoted attributes for example, "< a href= images/onomatopoeia.png border =0>" where you need to have an inNonquotedValue when you hit a '=' until you hit a space, unless the space is after the = but before the value...

    It's more code, but it will probably end up more readable than a regular expression that handles all the retarded stuff people put in HTML, and you can do other work other than merely lowercasing tags and attributes.

  • (cs)

    Interestingly enough, the code provided in the OP will not translate either "HTMl", "HTmL", "HtML" or "hTML" tags to lowercase, either.

  • (cs) in reply to Opie
    Opie:
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    No. Any proper xml parser would barf at that one, too, for precisely that reason: > is a reserved character and must be escaped in the string. So the fact that the regexp processor would fail is not limited solely to regexps. Doing it via a character by character string match would also fail unless your parser craps on the standards for parsing of HTML.

    As a fun fact, apparently escaping of < even inside attribute values is mandatory for exactly that reason. When designing XML, one of the minor design goals was to allow superficial hackish manipulation from the (quote) "desperate perl hacker".

  • Buga (unregistered) in reply to JL

    Instead, you should either wrap the exception in a new exception and throw that, or just use the "throw" keyword on its own, which rethrows the exception without resetting its stack trace. <<

    Unless you are catching an exception to throw a more specific exception or do "something" with it, there is no reason to catch and re-throw, just let it bubble up.

  • sjs (unregistered) in reply to Bob Janova
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    You fail at regexes.

    s/
    <               # opening angle bracket
      (
        "[^"]*" |   # double quoted stuff...
        '[^']*' |   # or single quoted stuff...
        [^'">]      # or other stuff
      )*            # 
    >               # closing bracket
    /\L$1/xg        # sub with the lowercase version (this lowercases all attrs as well, can be fixed)
    
    Bob Janova:
    I didn't know regexps could replace text with its lower case equivalent? Or am I missing the point here (i.e. the lower casing is only done at comparison time)?

    In Perl use /L in the replacement to lowercase all following chars, /l to lowercase only the next char.

    http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES

  • Abscissa (unregistered) in reply to anon
    anon:
    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    Heh, that's classic :). Though I have to admit I find regexes to be occasionally useful for fairly simple things. It's when you start getting to "the real power of regex" that it quickly turns to unmaintainable crap.

    A favorite quote of mine (although I wish I could remember where I saw it): "Perl - The only language that looks the same before and after encryption."

  • Andrew (unregistered) in reply to Stof
    Stof:
    Won't work by default. regexps are greedy and such a patter will match all the text between the first < and the last >

    It's easy to match distinct open & close tags. Match the open tag, consume anything that is not the close tag, and match the close tag i.e.: m/<[^>]*>/; Work on this tag string as needed.

    A common (PERL) WTF is using the non-greedy test (.?) to find the nearest close tag: m/<.?>/; These expressions can backtrack forever! If I get one person to stop using non-greedy tests, I'll have much happier, and faster, servers.

  • Jason Stein (unregistered) in reply to Welbog

    First time poster, but I'm pretty sure that only works with non-greedy regexes. You really need something like <[^>]+>.

    Sorry if this has been pointed out, but didn't read to the end.

  • Jason Stein (unregistered) in reply to Jason Stein

    I'm an idiot. See previous two posts.

  • Watson (unregistered) in reply to RON
    RON:
    A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.
    Backreferences notwithstanding :)
  • Watson (unregistered) in reply to Opie
    Opie:
    No. Any proper xml parser would barf at that one, too, for precisely that reason: > is a reserved character and must be escaped in the string.
    No. That is a perfectly legitimate XML tag, attribute value, unescaped > and all. It's the < character that's verboten.

    I'd quote W3C chapter+verse here, but I'm not enough of a pedant to remember such things.

  • (cs) in reply to Cyberwizzard
    Cyberwizzard:
    First!

    Omfg, thats not even funny... How can someone who obviously never heard of toLowerCase() (or its equivilents) write a HTML parser... I'd love to see what that thing does when I want to see its DOM tree... it does have a DOM tree right?.... I feel a headache coming up... sigh

    It's not that straightforward. A simple toLowerCase() will mess up the entire text of the document too - remember that at this point you still haven't actually isolated the tags from the content.

    I can't think of any way to solve this without using regex.

    Well, apart from the solution that was already implemented. If you can call that a solution...

  • Rob (unregistered) in reply to grg

    OK, I have to say, the real WTF is how few people in this thread seem to have any concept of what a regular expression actually is and how dynamically useful they are.

    Regular expressions haven't been REGULAR since the early 70s!!

    Technically, they are linear bounded automata that can match context-sensitive grammars and if you were slick enough and had enough time (and used perl and /x modifier to save your sanity) you could build an entire xml verifier. (You'd save some typing by pre-computing certain portions too). Note that I said verifier because, no you wouldn't necessarily be able to do a lot more with it than confirm that the xml was well formed but you COULD do it.

    And just in case you're really that stupid: NO you would never do that because using regexes in this way is NP-hard and wouldn't be very efficient but it would be possible. (This is probably why so few human languages exist that cannot be expressed using linguistically based CFGs, most people just can't handle more work than a simple stack.)

    AHH...my soapbox just collapsed...

  • esrever_otua (unregistered) in reply to Stof
    Stof:
    Welbog:
    I take it you missed it when I said reluctant.
    Not really, it's just that I didn't knew that term.

    What about :

    <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    Your example isn't valid XML

  • Anonymous (unregistered) in reply to Strider
    Strider:
    < [A-Za-z0-9]+) i dont get it, whats the parenthesis for...and why the leading space? how about

    <.[^ ]*

    i.e., if you find an open bracket, ignore the first character and match all the way till you find a space...thats what I'm trying to do, don't know if this regex is what I just said...(not a regex pro)

    Try your suggestion on:

    This is just a test.
    

    You'll match "ust", which shouldn't be what you intend to match.

  • Anonymous (unregistered) in reply to grg
    grg:
    Ahem, I havent seen anybody menthing this so far, but IMHO the beiiger WTF is thinking you can parse HTML with regular expressions.

    Yes and No.

    It would be complicated to implement a complete HTML parser solely in regexes. But regex is useful as a tool in implementing a part of the parser: the lexer. Ever heard of lex and yacc? (There are free clones called GNU flex and bison.)

    People do use lex and yacc to build compilers for PROGRAMMING LANGUAGES. Using these tools to build an HTML parser would be trivial for compiler writers. And guess what! "lex" code uses regular expressions! ("yacc" code uses BNF-like grammar expressions.)

    grg:
    That's basically impossible.

    You see regular expressions, are cryptic, but REGULAR.

    I don't think so. Well-written regexes are much more readable and maintainable than hundreds of lines of VB code that does the same thing.
    grg:
    And HTML isnt.

    For example you can embed almost anything into a HTML document-- JavaScript, PHP, and worse. Each one of those languages has it's own syntax. Worse yet, that code tends to have quite a few HTML fragments in quotes of various kinds.

    To correctly parse HTML you need a lexical scanner that goes through the HTML from top to bottom, interpreting every token it sees, and doing the right thing depending on context.

    Yeah. If I need to write an HTML parser, I'll structure it that way: lexer, parser. But how would I code the lexer? I'd use regexes. Just like what people do using "lex".

    grg:
    Regular expressions just don't give that kind of flexibility. At least not as they're commonly used.

    I suppose you could scan the input one character at a time with a reguiular expression, but that would not be in any way fast, or cleean, or probably reliable.

    In my lexer, I'd scan a whole tag at a time, exploiting regexes as much as possible, but not digging into the kind of complexity that would forgo readability (to a regex expert) and maintainability. I'd also scan a whole fragement of consecutive, untagged text without entity references at a time. So, I'd end up token boundaries akin to SAX events. :)
  • (cs) in reply to ObiWayneKenobi
    ObiWayneKenobi:
    Oh good sweet god. Love how it replaces the tags.

    And.. I don't get the deal with the Exceptions. It's bad to throw an exception in a try/catch block? But it's even worse to go: throw new Exception? So... what the hell do you do with the exception? I mean, isn't the proper error handling to throw it, so the calling method (in the presentation layer) can also implement try/catch and then display the error? You can't display the error in the business logic or data logic layers, so what more can you do except throw the exception?

    If you can't handle the exception you should let it go. By catching it and throwing a new exception you eliminate the stack trace and some of the other information contained in the original exception. Unless that's your intent for whatever reason, you should just let it pass to the next layer until it reaches somewhere that can handle it. Your application should have a top level exception handler that handles fatal exceptions which can't be caught at any other level and gracefully terminates the application. For some errors that's the best you can do.

  • taschenorakel (unregistered) in reply to Dan
    Dan:
    Face it, regexp where never meant to parse xml :)
    Too bad for you that XML is specified using regular expressions. Regarding your "evil" example: Despite it is not valid XML, the following RE can deal with it:

    </?(\w+)((?:\s+\w+="[^"]"))\s*/?>

    as the following code snippet demonstrates:

    import re
    
    doc = """\
    <html>
     <body>
      

    foo

    <other with="some more attributes" class="foo" style="bar" /> <other class="foo" style="bar">blubber</other> <tag value="value with a > just to mess with regexp parsing mechanisms" /> </body> </html>""" for tag, atts in re.findall(r'</?(\w+)((?:\s+\w+="[^"]*")*)\s*/?>', doc): print '%s[%s]' % (tag, atts.strip())
  • (cs)
  • Stof (unregistered) in reply to esrever_otua
    esrever_otua:
    Your example isn't valid XML
    I see them forbid the < but not the > in that link. So my exaple stands.
  • celeriac (unregistered) in reply to esrever_otua
    esrever_otua:
    Stof:
    What about : <tag value="value with a > just to mess with regexp parsing mechanisms">

    Face it, regexp where never meant to parse xml :)

    Your example isn't valid XML

    Did you read the link you posted? It forbids "<", not ">".

  • Stealth Potato (unregistered) in reply to m0ffx
    m0ffx:
    RON:

    A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.

    If by that you mean you can do anything with a regexp - well yes. But you can equally do anything with brainfuck. Just because you can do a certain job in any given language doesn't mean you should.

    No, that's not what he means. You can't do "anything" with regexps, since as RON pointed out they are equivalent with the nondeterministic finite automata. An example of what you cannot do with any NFA is recognize a context-free language that is not a regular language. XML is not a regular language.

    I'm glad to see several other posters here who have a grasp on the theory. The rest of you need to go back to school. :-)

    And Rob: I'm aware that "regular expressions" as referred to by programmers are very different from the "regular expressions" referred to by computer scientists. I just happen to think they shouldn't have the nerve to call them "regular" anymore, since they're not. :-P

  • luper (unregistered) in reply to Stof

    Why are regexps unsuitable for xml parsing? They work.

  • (cs) in reply to Andrew
    Andrew:
    Stof:
    Won't work by default. regexps are greedy and such a patter will match all the text between the first < and the last >

    It's easy to match distinct open & close tags. Match the open tag, consume anything that is not the close tag, and match the close tag i.e.: m/<[^>]*>/; Work on this tag string as needed.

    A common (PERL) WTF is using the non-greedy test (.?) to find the nearest close tag: m/<.?>/; These expressions can backtrack forever! If I get one person to stop using non-greedy tests, I'll have much happier, and faster, servers.

    Can you elaborate on why reluctant operations take longer than greedy operations? I admit to not knowing how regexp algorithms work at the lowest level, but it seems to me that the naive approach to matching a regexp would be reluctant and greedy operations would be the harder case. The naive approach being just iterating through the string from lowest index to highest.

    Do Perl regexps go back to front or something?

  • dkf (unregistered) in reply to Welbog
    Welbog:
    Can you elaborate on why reluctant operations take longer than greedy operations? I admit to not knowing how regexp algorithms work at the lowest level, but it seems to me that the naive approach to matching a regexp would be reluctant and greedy operations would be the harder case. The naive approach being just iterating through the string from lowest index to highest.
    If you really want to know, read the source. (Note that mixed-mode RE engines are difficult to fully grok, especially with backreferences.) But the simple answer is that there's a bunch of trade-offs; make one case fast and the others become slower.

    Captcha: craaazy (ah, it's read Henry Spencer's RE engine code!)

Leave a comment on “Just in Case”

Log In or post as a guest

Replying to comment #130005:

« Return to Article