- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Actually, regexp by themselves provably can't handle things like parenthesis matching. Not even the best programmer can program a regex which will match nested parenthesis (i.e.: "(())()"). Though I think you're correct in that regex should be able to parse an html (or xml) tag (if the programmer is good enough).
Parsing the entire [ht,x]ml tree can't possibly be done with regex, however, because of the possibility of nested elements.
Admin
I'm going to be a little bit informative instead of joining in the flame war. If you have a dodgy HTML doc that you want to parse and you are using Java, then TagSoup could be just what you need. It behaves exactly like a standard SAX parser hiding all the nasty hacks. For general html cleanup of course Tidy/JTidy is great.
Admin
You're wrong. As I stated before, the regular expressions as defined by Kleene (as in Kleene *) in the 1950s matched regular languages but as soon as computers came around, and particularly with the inclusion of backreferences in sed, awk, and perl, regexes became much more powerful.
Larry Wall, the inventor of perl recently decided to only call them regexes to attempt to avoid this confusion.
This quote comes from the perl documentation with "captured substrings" meaning backreferences like \1 in awk, $1 in perl:
"There is no limit to the number of captured substrings that you may use."
So in perl and any other tool that uses its engine (python, php, grep -P, pcregrep, etc), regexes would be Turing machines if you had infinite memory and Linear Bounded Automata. That means that YES they can match nested parenthesis and YES they can match nearly any arbitrary (decidable) language in the computational sense of the word (look it up)
However, because they only operate on variables that contain exactly one value (cells in the computability terminology), and that are explicitly write-once, read forever, it would be very tedious to write regexes to parse many of these arbitrary languages.
But not impossible, dammit. And in the meantime, any text processing you do in any language without regexes, a perl programmer can do faster (in terms of coding time), and a good perl programmer can do faster and without any hideously unreadable or inefficient regexen.
So don't talk about "provably" impossible things when you don't know what you're talking about.
Admin
I want to see a regular expression that can match balanced parentheses before I will believe this. I don't want to see a Perl script that can do it; I want to see a regular expression (or, "regex" I guess, to use your terminilogy).
Provide me with an example of a regex matching a non-regular language, and you'll have won me over.
Admin
That's not what Rob's claiming; he's claiming that what he's calling a "regex" isn't a regular expression but a Turing Machine. (I'd be tempted to call it an n-PDA myself, since backreferences can be accessed out of order, but since n-PDAs and TMs are computationally equivalent that's just a quibble.)
I don't think Rob phrased his argument well, but it doesn't contain any incorrect claims about the computing power of regular expressions or "regexes" - just a claim that the latter term refers to a different class of machine than the former.
I take no stand on whether this terminological claim, or the ascription of it to Larry Wall, is correct. I do the dilution of the term "regex" (or "regexp", etc) is unfortunate, but here we are. Though I'll note that back before Larry decided to go all Dr Frankenstein and was still writing things like rn and patch, we used to call these things "EREs", for "extended regular expressions", which at least suggested that we weren't talking about Kleene's regular expressions.
-- Michael Wojcik
Admin
Companies think they can save $$$ by hiring cheap Indian and Chinese H1-Bs. They get this.
Admin
#!/usr/bin/perl print $ARGV[0] =~ /((((?{$i++}))()(?{$i--})))*(??{("[^\w\W]")x(abs $i)})/ ? "BALANCED\n" : "NOT BALANCED\n";
Admin
And if you don't like arbitrary code execution then my previous statement about LBAs still holds but now you just use backreferences to handle a predetermined number of parenthesis (that's the linear-bounded part).
Admin
Admin
Finally someone brings in the CS language theory stuff. I expected page 1, but no...
Admin
$ checkbalparen "()" BALANCED
$ checkbalparen "())" NOT BALANCED
$ checkbalparen "((((()))))" BALANCED
$ checkbalparen "((((()))))" -- I ctrl-C'd this because it appeared to hang
$ checkbalparen "((((())))" BALANCED -- uh, no it's not
$ checkbalparen "(()" BALANCED -- doh
$ checkbalparen "(()))" NOT BALANCED
$ checkbalparen "((())))" NOT BALANCED -- hmmm, that took a noticeable amount of time to execute
$ checkbalparen "(((()))))" NOT BALANCED -- hmmmmmmmmmmmmm, that took a few SECONDS
$ checkbalparen ")(" BALANCED -- 'nuf said, I think
Admin
Admin
did someone say "sub-continent"
Admin
Who the fuck is going to use a regular expression in this case? This is why languages provide constructs like
foo.downcase
Something simple like tr/A-Z/a-z/ or an s///g-based solution is the worst idea to do. Besides being nowhere near as readable or intelligible, it's also wrong. This assumes only characters within the 128-bit ASCII character set, and fails on any character with uppercase and lowercase equivalents outside the A-Z range. Yeah, you aren't going to have that in HTML, but even getting in the habit of writing a regexp for this sort of thing is a bad idea.
Regular expressions are great for many things. But just as you wouldn't use one for a serious attempt at a primality test ('1' * num !~ /^1?$|^(11+?)\1?$), you shouldn't use it for case changes or parsing structured data like XML, YAML, or SGML.
Admin
The canonical example is
which defines {ww | w in sigma*} which is neither regular, nor context-free. Regexps as implemented in current programming languages are not regular in the computer-science-theoretic sense, and have been extended to describe other classes of languages.
Admin
I can prove to you that a Turing machine can solve the graph isomorphism problem, which is NP-complete. I can't show you an example of it working, however, for two graphs with a million nodes apiece. Do you therefore disregard the proof?
Admin
Just a couple of words
RegExBuddy
Regexbuddy.com
See the light.
Admin
Admin
True, cross your fingers that it's well-formed.
Admin
Screw toLowerCase(), just do s/<(.*?)>/\L$1\E/g. Easy easy.
Captcha: Stinky. I am not!
Admin
No, that's wrong. I have used regexes in PHP, and I can understand it. It isn't unreadable or unmaintainable.
And perl programmers don't just throw scrabble tiles on the floor and write whatever lands face-up, it would be unlikely to do what is expected if you do that.