- Feature Articles
- CodeSOD
- Error'd
- 
                
                    Forums 
- 
                Other Articles
                - Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
 
 
            
Admin
They had to pass over twice to remove attributes, so maybe they feel they gotta do this twice too?
Admin
Admin
Most vehicular wheels aren't perfectly circular... they're a little bit flat on the bottom.
Admin
Some programs cant write even valid html, so someone has to fix the errors of the others. Good example is the mess that M$ Word creates when you save it as html..
horrible stuff. indeed.
Admin
I like how they'll remove style="stuff" but not style = "stuff" - despite all of their attempts at stripping multiple trailing spaces from p and div tags.
Admin
Here's part of my solution for "cleaning" HTML: https://github.com/rowan-lewis/wiki/blob/master/wiki/libs/html.php
So yeah, it uses XPath, Lib Tidy, regular expressions and then XSLT to output only the desired markup.
Admin
Ouch. Just ouch.
Admin
Especially with IgnoreCase set ... after all, those uppercase blanks are there for a purpose.
Admin
It's said when someone makes a joke about doing something and then you remember that you did do that once in the past.
Admin
The center cannot hold!
Admin
Admin
Yes, if they'd done that they could have completed the whole project in six weeks...
Admin
All the nice RegEx's in the beginning... (btw, one running one of them twice only removes two attributes from the tags, so if there are more...), and then:
html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase); html = html.Replace("", ""); html = html.Replace("", ""); // snip repeats html = html.Replace("\n", ""); html = html.Replace("\r", ""); html = html.Replace("", ""); html = html.Replace("", ""); // some more snipping html = html.Replace("  ", "");
    html = html.Replace("
", ""); html = html.Replace("
", ""); while (true) { if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<"); }CAPTCHA: conventio ... it is truly against convention what the coder(s) is/are trying to do here.
Admin
He even stuffed up the if statement at the end, he has an OR function on two identical strings.
Admin
Oh, you're good.
Admin
Provided your purpose is to generate a plan text equivalent, not to prevent malicious users injecting rogue html tags:
<script>exploit.run()</script>
Run the above throgh SAX and you'll probably get:
<script>exploit.run()</script>Admin
Admin
And in case anyone cares. The correct solution is to download jsoup and then do a
Admin
Gotta love the double checking nature even at the most elementary level:
(html.Contains(" >") || html.Contains(" >")I'm sure they thought of the possibility of the value of the HTML string changing on another thread
Admin
Admin
Think you meant hovel rather than hobble ;)
Admin
(html.Contains(" >") || html.Contains("< ")I mean, who hasn't done something like this ?Admin
Ha, I see what you did there. Clever.
Refactored to catch all scenarios:
if (!(html.Contains(" >") || html.Contains(" >"))) && if ((html.Contains(" >") == false || (html.Contains(" >") == true))Admin
I'm pretty sure they wouldn't have to do all this if they were using BobX..
Admin
So why were half of yesterdays comments truncated?
Admin
Admin
good old copy-paste: if (!(html.Contains(" >") || html.Contains(" >")))
Admin
Glad Alex, took care of Mr V1Agra. Hopefully this comment will post.
Admin
My Mom wrote something on Office 2009 or so, saved it as .docx (it was the default, and then took it to my uncle's place where he had some old XP machine running WordPerfect. Needless to say, nothing would open the file, and my uncle didn't want me installing anything on that machine.
So I opened the file in Notepad, saw that the first two letters were "PK", changed the file's extension to .zip, opened that, and found the actual document file (as opposed to the six other files that held settings and whatever else).
Egads.
Bear in mind that the original file just had default font, style, page, and other settings. It should have been pretty much just a text file. But no.
Every single "misspelled" word (i.e. every proper name) was wrapped in a five-tag-deep stack: one tag to mark it as a misspelled word and the others to reapply Arial 13 Regular single-spaced left-justified or whatever. Every single "grammar error" got the same treatment. Bad grammar and spelling? Let's just say I got to see how far that rabbit hole went.
Worse yet, WordPerfect didn't have a search-and-replace that supported wildcards, only exact matches. (Remember, not allowed to install software, so WP, Notepad, and Wordpad were all that was available.) Mercifully, it had an "extend selection to search term" checkbox. It still took 15 minutes just to strip off all the tags and figure out which one (I think it was "w:p") marked off paragraphs.
It took Microsoft over 6000 pages to describe this file format as a standard. Ugh.
Admin
That looks like a bug, it should probably read:
if (!(html.Contains(" >") || html.Contains(" <")))
Admin
"The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events."
Until your SAX processor runs into a lone BR tag and bombs because it isn't correct XML.
Admin
Two pages of comments and nobody has pointed out yet that the RWTF is to use Regex on a SGML derivatives. WTF
HTML and XML are not context free and can not be properly parsed by a regular expression.
For example this is valid html:
<tag attrib="abc>de">or this is valid XML
<hello>>and now>>>>>></hello>
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
Admin
I stop and stare at the very first regex:
[^>]*?
This allows zero or more characters (that are not an end bracket), and makes it optional too just in case nothing isn't there either. I think.
nulla: nienta, nada
Admin
Apart from the obvious OMGs, what's with this?
All those headbanging regexes above and yet no \s* here ?!Admin
Admin
Admin
Yeah! Thats the tune to Funky Town!
Admin
Admin
I wrote an XHTML sanitizer for j0rb. Emphasis on the X (if that wasn't clear). I wasn't sure how slow it would be, but nonetheless, I load the document with a parser (requiring well-formed XML), and then walk every node in the tree. It supports two levels of "bad": remove and abort. Elements that are considered harmless, but undesirable, are replaced by the inner text. Attributes that are considered harmless, but undesirable, are removed entirely. Elements and attributes that are considered potentially malicious (<script>, onanything attributes, etc.) just log the occurrence, throw an exception, and refuse the data until the user fixes it. Much to my surprise, it actually seems to perform pretty decently. I'm not saying that it's the right way, but it's certainly a lot more right than this. :P
Admin
Admin
A valid XML document must be well-formed and also needs to contain a DOCTYPE declaration and must conform to it precisely.
If you wanted the text ">and now>>>>>>" to appear in the <hello> element of your well-formed XML document then you would need to escape the '>' characters:
Or put them in a CDATA section:
That said, you still can't reliably parse XML (nor any SGML dialect that I am aware of) with a regular expression. It's simply not possible.
Admin
FTFY.
Admin
Admin
Bzzzt, wrong. < is required to be replaced by a character entity, > not. Don't believe me? Look for yourself
http://www.w3.org/TR/REC-xml/#dt-chardata
Admin
Signed, The Real Nagesh
Admin
It's definitely not possible. Regular expressions can parse regular grammars. XML is not a regular grammar. It's not even context-free, as is often claimed. You can use the pumping lemma to prove it both of these claims.
You can use a "completion" of regular expression parsing to parse context-free grammars, but it is nasty. (To complete a category, you embed it in a larger category that contains all of the first category's limits) Basically, you have to use multi-pass parsing to handle arbitrary nesting by splitting on the "next" nested expression at the same level in the parse tree. It will be slow. If you're going to do this, you might as well be using recursion and/or a good parser generator. All the hard stuff will be hidden away in the parser generator's bowels or the one "tricky" recursive definition which captures the normal form for the computation. Recursion is easy anyway.
Admin
No. The result is potentially the same, however because test() is a method, not a variable they are not the same thing. Consider:
static int myTestVar = 0; public bool test() { return (myTestVar++%2 == 0); }A more complicated example might be less predictable. What if "test()" simply runs the next test case in a file?
although the two test() calls appear funny, there are plausible explanations to what is going on (that's not to say it's necessarily good practice).
Admin
OOPS - let me retract that, I read your code rather than the code in the article.....My Bad!!
Admin
Plus there seemed to be a thread starting about V1agra
Admin
I thought of that, too. So I did this:
That got it into production the first time...