- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
I figured out how to make addition work in C.
//This solves A + B. int add(int A, int B){ if (A == 1){ if (B == 0) return 1; if (B == 1) return 2; if (B == 2) return 3; if (B == 3) return 4; if (B == 4) return 5; if (B == 5) return 6; if (B == 6) return 7; if (B == 7) return 8; if (B == 8) return 9; ... }}
Feel free to use the above code. I am releasing it under GPL as soon as it is approved.
Thank you.
Admin
If that's what you want to do, then don't catch the exception in that function at all in the first place.
Data Access Layer --> throws exception Business Logic Layer --> no try/catch whatsoever UI Layer --> catches exception and displays error message to user
If a function needs to do some cleaning up in case an exception tears through it, then use try/finally with no catch. If you want to do something with the exception itself, but still rethrow it e.g. to a UI layer, use 'throw' by itself to preserve the stack trace.
Admin
In a slightly non-standard dialect of RegExp (in that capture groups are identified by {}, \s indicates any whitespace character, and angle brackets must be escaped by backslashes), I have tested this: <{[^\s>]+}>|<{[^\s>]+}[^>]+>
If you lowercase the capture group on each match, it works.
Even for code like:
Sure, it fails in a a few remaining corner-cases:
Of course, I don't expected that attributes with a ">" followed by a "<" occur very often in regular usage.
I also believe that (while entirely possible) extending a Regular Expression to catch valid HTML attributes and filter them out would be excessive.
Admin
I for one laughed so hard I cried....thank you.
Admin
Greedy: * Not greedy: *?
Greed is definitely not the problem here.
Admin
ha.. yeah, admittedly, I didn't read the whole comment (missed the scrabble part). But reading back over it, the sarcasm is fairly obvious.
I hadn't had my morning coffee yet, give me a break :)
Admin
Do the Html tags get a happy ending after their massage?
Admin
Regex handlers are (finite) state machines.
An HTML parser would (probably, I think) be a push-down automaton (effectively a finite state machine with a stack). But the HTML would have to be perfect. Any program which has to parse HTML robustly would be a horrible mess of special cases and ugliness.
Admin
You were kidding, right? There are about 40 posts preceding yours on that exact topic.
Admin
Okay.. so it's not so much the catching of the Exception as the method where it was done. Gotcha. That makes sense. Thanks!
Admin
Admin
Spoken in response to a request on how to use RegEx'es for HTML.
Admin
[quote user="skington"]Still, this is a "bug" and one that you cannot solve simply with regexps. Why? Because regexps are used to work on flat strings and not on tree organised data.
Try to write a regexp to match the content of a tag inside a
With the appropriate negative look-ahead assertions you can probably pull it off. It would be silly to do so if you had a decent XML-parsing library to hand, though, I agree.[/quote]
You could do exactly that with Perl and that happens to be the number 1 reason Perl should die a fiery death. Inclusion of executable code WITHIN a regexp should be illegal.
I've written XML and HTML parsers myself using regexps to match tags with end tags, locate name=value pairs and quote-enclosed strings which the PARSER then understands and builds its tree with. Regexp isn't the parser, just a tool used by a parser. And they're extremely efficient.
Admin
A regular expression IS a state machine. Regular languages are 1-to-1 convertable from NFA's, DFA's, and RegExp's.
Admin
I've heard that quote before. Except it wasn't with regular expressions, it was with XML.
Anyone who thinks using a regexp is a problem is probably using it for something they're not intended for.
Admin
If by that you mean you can do anything with a regexp - well yes. But you can equally do anything with brainfuck. Just because you can do a certain job in any given language doesn't mean you should.
Admin
No. Any proper xml parser would barf at that one, too, for precisely that reason: > is a reserved character and must be escaped in the string. So the fact that the regexp processor would fail is not limited solely to regexps. Doing it via a character by character string match would also fail unless your parser craps on the standards for parsing of HTML.
Admin
Also, there's no reason an XHTML 1.1-compliant parser could not use regular expressions and a stack to push and pop the most recent tags to/from.
It would require that the document be perfect, but then again XHTML 1.1 requires the document to be perfect already, so what's the problem?
It would actually probably be far more efficient and is probably (by my reckoning) one of the reasons behind that part of the standard.
Admin
You're missing the point. A regular expression literally is a state machine. They were designed to describe regular languages. In fact every Regex implementation I've ever seen compiles it down to a state machine, so the statement that it's better to use a state machine than a regex for parsing HTML is purely nonsensical.
Admin
EXACTLY. Thank you...
If XML isn't a regular language, I don't know what is... Just because people write crappy HTML doesn't mean the parser is wrong. There is a reason there are HTML designer tools - it's to eliminate user error. If someone writes a bad HTML document because they did it manually and made a typo, it's their own damn fault. The programmer of the parser should not be responsible for errors made by an arrogant HTML "coder" who feels he's too good for WYSIWYG.
If you write good HTML in notepad, then more power to you. Just don't complain when a parser can't read a document you created because of a misplaced quote, tag, or brace.
It's a perfect example of knowing enough to be dangerous. Use the tool - it'll save you time, plus they all let you screw with the raw html anyway...
Admin
Bloody hell, people. I'm no regex expert, but even I know that you DO NOT use regexs to parse HTML. Every informed opinion I've read on the subject has been exceedingly clear on this point.
Admin
I'm not saying using REGEX with HTML is good or bad, and it's not really something I would do (although, I do work woth a specific case that I inherited where it's looking for a VERY specific line in an HTML document, not so much for parsing tags though). But this REGEX (I'm fairly new to using them) seems to get the first part of the tag into $1 and then the first quoted string into $2 and the rest in $3 (sorry, worked it out in Perl)... could probably be used in a higher-level language to create something. Maybe a recursive function to keep processing $3 until it's empty? I dunno. Just a starting point.
/<([a-z0-9:=;/! ]+)"([^"]+)"([^>]+)>/i
Admin
Never mind, that was pretty sloppy. Oh well, that's what I guess for trying to rush through in a programming language I only use rarely while trying to implement RegEx which I don't use that often either.
Admin
The simplest non-regular language that we all know and love is the language of all palindromes. It is not regular because regular expressions cannot (at least in the truest sense) remember the past. Once a part of a string is read by a regular expression, it is lost and cannot be compared to anything later in the string. Since a palindrome requires the beginning of a string to be compared to the end, it cannot be regular. (Yes, I know this isn't a rigorous proof, but I don't want to go through the pumping lemma for this. A high-level explanation should be enough).
Now think about well-formed XML. Each element has to be opened and then closed: <element>[inner stuff]</element>. Now think about this as a palindrome, <element> has to be read and then later campared to </element> for it to be valid. The state machine has to know that the first element as <element> in order for </element> to be valid. But the best a regular expression can do is know that there was a well-formed element of indeterminate name. And so, by this hand-wavey (and completely correct) argument, XML is not a regular language. It's context-free, like palindromes. You need a stack alongside a simple regular lexer in order to validate it.
Admin
Wasn't "Some people, when..." originally sed, specifically, not just regular expressions?
Still true, of course.
Admin
Everybody is so excited about using regular expressions for this...
Just iterate through the html text (the char []) with two booleans - inTag and inQuote. When you encounter '<' set inTag. If you are inTag and encounter a " or ', negate inQuote. When you encounter a '>' and inQuote==false, unset inTag. Otherwise lowercase the character if inTag && !inQuote. It's O(n) and you can do other things like build the DOM tree whilst iterating (with a little more work of course, but it is the logical thing to do at this stage).
Or you could have an enum of parser states because at some point you know will have umpteen booleans and that's tatty. The above doesn't handle unquoted attributes for example, "< a href= images/onomatopoeia.png border =0>" where you need to have an inNonquotedValue when you hit a '=' until you hit a space, unless the space is after the = but before the value...
It's more code, but it will probably end up more readable than a regular expression that handles all the retarded stuff people put in HTML, and you can do other work other than merely lowercasing tags and attributes.
Admin
Interestingly enough, the code provided in the OP will not translate either "HTMl", "HTmL", "HtML" or "hTML" tags to lowercase, either.
Admin
As a fun fact, apparently escaping of < even inside attribute values is mandatory for exactly that reason. When designing XML, one of the minor design goals was to allow superficial hackish manipulation from the (quote) "desperate perl hacker".
Admin
Unless you are catching an exception to throw a more specific exception or do "something" with it, there is no reason to catch and re-throw, just let it bubble up.
Admin
You fail at regexes.
In Perl use /L in the replacement to lowercase all following chars, /l to lowercase only the next char.
http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES
Admin
Heh, that's classic :). Though I have to admit I find regexes to be occasionally useful for fairly simple things. It's when you start getting to "the real power of regex" that it quickly turns to unmaintainable crap.
A favorite quote of mine (although I wish I could remember where I saw it): "Perl - The only language that looks the same before and after encryption."
Admin
It's easy to match distinct open & close tags. Match the open tag, consume anything that is not the close tag, and match the close tag i.e.: m/<[^>]*>/; Work on this tag string as needed.
A common (PERL) WTF is using the non-greedy test (.?) to find the nearest close tag: m/<.?>/; These expressions can backtrack forever! If I get one person to stop using non-greedy tests, I'll have much happier, and faster, servers.
Admin
First time poster, but I'm pretty sure that only works with non-greedy regexes. You really need something like <[^>]+>.
Sorry if this has been pointed out, but didn't read to the end.
Admin
I'm an idiot. See previous two posts.
Admin
Admin
I'd quote W3C chapter+verse here, but I'm not enough of a pedant to remember such things.
Admin
It's not that straightforward. A simple toLowerCase() will mess up the entire text of the document too - remember that at this point you still haven't actually isolated the tags from the content.
I can't think of any way to solve this without using regex.
Well, apart from the solution that was already implemented. If you can call that a solution...
Admin
OK, I have to say, the real WTF is how few people in this thread seem to have any concept of what a regular expression actually is and how dynamically useful they are.
Regular expressions haven't been REGULAR since the early 70s!!
Technically, they are linear bounded automata that can match context-sensitive grammars and if you were slick enough and had enough time (and used perl and /x modifier to save your sanity) you could build an entire xml verifier. (You'd save some typing by pre-computing certain portions too). Note that I said verifier because, no you wouldn't necessarily be able to do a lot more with it than confirm that the xml was well formed but you COULD do it.
And just in case you're really that stupid: NO you would never do that because using regexes in this way is NP-hard and wouldn't be very efficient but it would be possible. (This is probably why so few human languages exist that cannot be expressed using linguistically based CFGs, most people just can't handle more work than a simple stack.)
AHH...my soapbox just collapsed...
Admin
Your example isn't valid XML
Admin
Try your suggestion on:
You'll match "ust", which shouldn't be what you intend to match.
Admin
Yes and No.
It would be complicated to implement a complete HTML parser solely in regexes. But regex is useful as a tool in implementing a part of the parser: the lexer. Ever heard of lex and yacc? (There are free clones called GNU flex and bison.)
People do use lex and yacc to build compilers for PROGRAMMING LANGUAGES. Using these tools to build an HTML parser would be trivial for compiler writers. And guess what! "lex" code uses regular expressions! ("yacc" code uses BNF-like grammar expressions.)
I don't think so. Well-written regexes are much more readable and maintainable than hundreds of lines of VB code that does the same thing.Yeah. If I need to write an HTML parser, I'll structure it that way: lexer, parser. But how would I code the lexer? I'd use regexes. Just like what people do using "lex".
In my lexer, I'd scan a whole tag at a time, exploiting regexes as much as possible, but not digging into the kind of complexity that would forgo readability (to a regex expert) and maintainability. I'd also scan a whole fragement of consecutive, untagged text without entity references at a time. So, I'd end up token boundaries akin to SAX events. :)Admin
If you can't handle the exception you should let it go. By catching it and throwing a new exception you eliminate the stack trace and some of the other information contained in the original exception. Unless that's your intent for whatever reason, you should just let it pass to the next layer until it reaches somewhere that can handle it. Your application should have a top level exception handler that handles fatal exceptions which can't be caught at any other level and gracefully terminates the application. For some errors that's the best you can do.
Admin
</?(\w+)((?:\s+\w+="[^"]"))\s*/?>
as the following code snippet demonstrates:
Admin
Everybody stand back. I know regular expressions.
Admin
Admin
Did you read the link you posted? It forbids "<", not ">".
Admin
No, that's not what he means. You can't do "anything" with regexps, since as RON pointed out they are equivalent with the nondeterministic finite automata. An example of what you cannot do with any NFA is recognize a context-free language that is not a regular language. XML is not a regular language.
I'm glad to see several other posters here who have a grasp on the theory. The rest of you need to go back to school. :-)
And Rob: I'm aware that "regular expressions" as referred to by programmers are very different from the "regular expressions" referred to by computer scientists. I just happen to think they shouldn't have the nerve to call them "regular" anymore, since they're not. :-P
Admin
Why are regexps unsuitable for xml parsing? They work.
Admin
Do Perl regexps go back to front or something?
Admin
Captcha: craaazy (ah, it's read Henry Spencer's RE engine code!)