• austinjp (unregistered) in reply to Dan

    Actually, regexp by themselves provably can't handle things like parenthesis matching. Not even the best programmer can program a regex which will match nested parenthesis (i.e.: "(())()"). Though I think you're correct in that regex should be able to parse an html (or xml) tag (if the programmer is good enough).

    Parsing the entire [ht,x]ml tree can't possibly be done with regex, however, because of the possibility of nested elements.

  • Asd (unregistered)

    I'm going to be a little bit informative instead of joining in the flame war. If you have a dodgy HTML doc that you want to parse and you are using Java, then TagSoup could be just what you need. It behaves exactly like a standard SAX parser hiding all the nasty hacks. For general html cleanup of course Tidy/JTidy is great.

  • Rob (unregistered) in reply to austinjp

    You're wrong. As I stated before, the regular expressions as defined by Kleene (as in Kleene *) in the 1950s matched regular languages but as soon as computers came around, and particularly with the inclusion of backreferences in sed, awk, and perl, regexes became much more powerful.

    Larry Wall, the inventor of perl recently decided to only call them regexes to attempt to avoid this confusion.

    This quote comes from the perl documentation with "captured substrings" meaning backreferences like \1 in awk, $1 in perl:

    "There is no limit to the number of captured substrings that you may use."

    So in perl and any other tool that uses its engine (python, php, grep -P, pcregrep, etc), regexes would be Turing machines if you had infinite memory and Linear Bounded Automata. That means that YES they can match nested parenthesis and YES they can match nearly any arbitrary (decidable) language in the computational sense of the word (look it up)

    However, because they only operate on variables that contain exactly one value (cells in the computability terminology), and that are explicitly write-once, read forever, it would be very tedious to write regexes to parse many of these arbitrary languages.

    But not impossible, dammit. And in the meantime, any text processing you do in any language without regexes, a perl programmer can do faster (in terms of coding time), and a good perl programmer can do faster and without any hideously unreadable or inefficient regexen.

    So don't talk about "provably" impossible things when you don't know what you're talking about.

  • (cs) in reply to Rob

    I want to see a regular expression that can match balanced parentheses before I will believe this. I don't want to see a Perl script that can do it; I want to see a regular expression (or, "regex" I guess, to use your terminilogy).

    Provide me with an example of a regex matching a non-regular language, and you'll have won me over.

  • (cs) in reply to Welbog
    Welbog:
    I want to see a regular expression that can match balanced parentheses before I will believe this. I don't want to see a Perl script that can do it; I want to see a regular expression (or, "regex" I guess, to use your terminilogy).

    That's not what Rob's claiming; he's claiming that what he's calling a "regex" isn't a regular expression but a Turing Machine. (I'd be tempted to call it an n-PDA myself, since backreferences can be accessed out of order, but since n-PDAs and TMs are computationally equivalent that's just a quibble.)

    I don't think Rob phrased his argument well, but it doesn't contain any incorrect claims about the computing power of regular expressions or "regexes" - just a claim that the latter term refers to a different class of machine than the former.

    I take no stand on whether this terminological claim, or the ascription of it to Larry Wall, is correct. I do the dilution of the term "regex" (or "regexp", etc) is unfortunate, but here we are. Though I'll note that back before Larry decided to go all Dr Frankenstein and was still writing things like rn and patch, we used to call these things "EREs", for "extended regular expressions", which at least suggested that we weren't talking about Kleene's regular expressions.

    -- Michael Wojcik

  • Robert (unregistered)

    Companies think they can save $$$ by hiring cheap Indian and Chinese H1-Bs. They get this.

  • Rob (unregistered) in reply to Welbog

    #!/usr/bin/perl print $ARGV[0] =~ /((((?{$i++}))()(?{$i--})))*(??{("[^\w\W]")x(abs $i)})/ ? "BALANCED\n" : "NOT BALANCED\n";

  • Rob (unregistered) in reply to Rob

    And if you don't like arbitrary code execution then my previous statement about LBAs still holds but now you just use backreferences to handle a predetermined number of parenthesis (that's the linear-bounded part).

  • (cs) in reply to JAPH
    JAPH:
    anon:
    ... Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    ("Programmers" is in inverted commas since we all know Perl guys don't actually code, they just throw fistfuls of scrabble tiles on the floor and type in whatever characters show face up)

    don't be ridiculous, scrabble doesn't have punctuation
    They use the blanks as random punctuation.

  • David (unregistered) in reply to baron5038

    Finally someone brings in the CS language theory stuff. I expected page 1, but no...

  • baron5038 (unregistered) in reply to Rob
    Rob:
    #!/usr/bin/perl print $ARGV[0] =~ /((\((?{$i++}))*(\)(?{$i--}))*)*(??{("[^\w\W]")x(abs $i)})/ ? "BALANCED\n" : "NOT BALANCED\n";

    $ checkbalparen "()" BALANCED

    $ checkbalparen "())" NOT BALANCED

    $ checkbalparen "((((()))))" BALANCED

    $ checkbalparen "((((()))))" -- I ctrl-C'd this because it appeared to hang

    $ checkbalparen "((((())))" BALANCED -- uh, no it's not

    $ checkbalparen "(()" BALANCED -- doh

    $ checkbalparen "(()))" NOT BALANCED

    $ checkbalparen "((())))" NOT BALANCED -- hmmm, that took a noticeable amount of time to execute

    $ checkbalparen "(((()))))" NOT BALANCED -- hmmmmmmmmmmmmm, that took a few SECONDS

    $ checkbalparen ")(" BALANCED -- 'nuf said, I think

  • (cs) in reply to Rob
    Rob:
    #!/usr/bin/perl print $ARGV[0] =~ /((\((?{$i++}))*(\)(?{$i--}))*)*(??{("[^\w\W]")x(abs $i)})/ ? "BALANCED\n" : "NOT BALANCED\n";
    I'm gonna check this out when I get home (I don't have access to a Perl interpreter at work). I'm also going to have to brush up on my Perl to understand more about what's going on. Thank you for providing a sample and not just storming off in a huff like most people would.
  • the captcha text is xevious (unregistered) in reply to anon

    did someone say "sub-continent"

  • Stephen Touset (unregistered) in reply to DSTMan

    Who the fuck is going to use a regular expression in this case? This is why languages provide constructs like

    foo.downcase

    Something simple like tr/A-Z/a-z/ or an s///g-based solution is the worst idea to do. Besides being nowhere near as readable or intelligible, it's also wrong. This assumes only characters within the 128-bit ASCII character set, and fails on any character with uppercase and lowercase equivalents outside the A-Z range. Yeah, you aren't going to have that in HTML, but even getting in the habit of writing a regexp for this sort of thing is a bad idea.

    Regular expressions are great for many things. But just as you wouldn't use one for a serious attempt at a primality test ('1' * num !~ /^1?$|^(11+?)\1?$), you shouldn't use it for case changes or parsing structured data like XML, YAML, or SGML.

  • Stephen Touset (unregistered) in reply to Welbog
    Welbog:
    Provide me with an example of a regex matching a non-regular language, and you'll have won me over.

    The canonical example is

    /^(.*)\1$/

    which defines {ww | w in sigma*} which is neither regular, nor context-free. Regexps as implemented in current programming languages are not regular in the computer-science-theoretic sense, and have been extended to describe other classes of languages.

  • Stephen Touset (unregistered) in reply to Welbog
    Welbog:
    I want to see a regular expression that can match balanced parentheses before I will believe this.

    I can prove to you that a Turing machine can solve the graph isomorphism problem, which is NP-complete. I can't show you an example of it working, however, for two graphs with a million nodes apiece. Do you therefore disregard the proof?

  • Toby (unregistered) in reply to anon

    Just a couple of words

    RegExBuddy

    Regexbuddy.com

    See the light.

  • (cs) in reply to Stephen Touset
    Stephen Touset:
    The canonical example is
    /^(.*)\1$/

    which defines {ww | w in sigma*} which is neither regular, nor context-free. Regexps as implemented in current programming languages are not regular in the computer-science-theoretic sense, and have been extended to describe other classes of languages.

    Oh, wow. That's really neat. Thank you for that example.

    Stephen Touset:
    I can prove to you that a Turing machine can solve the graph isomorphism problem, which is NP-complete. I can't show you an example of it working, however, for two graphs with a million nodes apiece. Do you therefore disregard the proof?
    Turing machines ≠ DFAs. An algorithm exists and can be expressed in a high-level language for conventional execution; if you can prove that an equivalent Turing machine doesn't exist, then you've disproven the Church-Turing hypothesis. Wheeee
  • ChrisH (unregistered) in reply to teqman
    teqman:
    Regexps are fine for changing the case of tag names; it's not that hard to write a proper regexp to isolate elements of a well-formed HTML document.

    True, cross your fingers that it's well-formed.

  • mark (unregistered) in reply to Cyberwizzard
    Cyberwizzard:
    First!

    Omfg, thats not even funny... How can someone who obviously never heard of toLowerCase() (or its equivilents) write a HTML parser... I'd love to see what that thing does when I want to see its DOM tree... it does have a DOM tree right?.... I feel a headache coming up... sigh

    Screw toLowerCase(), just do s/<(.*?)>/\L$1\E/g. Easy easy.

    Captcha: Stinky. I am not!

  • zzo38 (unregistered) in reply to anon
    anon:
    Regexps are inappropriate for human exposure, period.

    Have you seen the things? Absolutely unreadable, unmaintable line noise. I've yet to meet anybody except Perl "programmers" who consider them defensible.

    No, that's wrong. I have used regexes in PHP, and I can understand it. It isn't unreadable or unmaintainable.

    $file_raw=' '.implode('',file($argv[1])).' ';
    preg_match_all('/(?<=\s)([^\[{\s]\S*|\[[^\]]*\]|\{[^\}]*\})(?=\s)/',$file_raw,$file_data);
    $file_data=$file_data[0];
    

    And perl programmers don't just throw scrabble tiles on the floor and write whatever lands face-up, it would be unlikely to do what is expected if you do that.

Leave a comment on “Just in Case”

Log In or post as a guest

Replying to comment #:

« Return to Article