• Jim (unregistered) in reply to np
    np:
    if (!(html.Contains(" >") || html.Contains(" >")))

    I like it when code looks like 1 || 1. Maybe they thought the first html.Contains(" >") wasn't enough to find those pesky " >".

    They had to pass over twice to remove attributes, so maybe they feel they gotta do this twice too?

  • le stuff (unregistered) in reply to Power Troll
    Power Troll:
    Wasn't perl created for doing stuff like this?
    "for doing stuff" seems unnecessary...
  • (cs)

    Most vehicular wheels aren't perfectly circular... they're a little bit flat on the bottom.

  • Thats not WTF, its reality. (unregistered)

    Some programs cant write even valid html, so someone has to fix the errors of the others. Good example is the mess that M$ Word creates when you save it as html..

    horrible stuff. indeed.

  • (cs)

    I like how they'll remove style="stuff" but not style = "stuff" - despite all of their attempts at stripping multiple trailing spaces from p and div tags.

  • buzzomatic (unregistered)

    Here's part of my solution for "cleaning" HTML: https://github.com/rowan-lewis/wiki/blob/master/wiki/libs/html.php

    So yeah, it uses XPath, Lib Tidy, regular expressions and then XSLT to output only the desired markup.

  • Shirley (unregistered)

    Ouch. Just ouch.

  • F (unregistered) in reply to brunascle
    brunascle:
    html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);

    The hell? That won't clean HTML, it will break it.

    Especially with IgnoreCase set ... after all, those uppercase blanks are there for a purpose.

  • anobin (unregistered) in reply to XXXXX

    It's said when someone makes a joke about doing something and then you remember that you did do that once in the past.

  • anon (unregistered)

    The center cannot hold!

  • (cs)
    if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<");
    So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.
  • P (unregistered) in reply to XXXXX
    XXXXX:
    They didn't write their own reg-ex language. Much less their own reg-ex processor. At this rate they'll never develop their own in-house, proprietary mark-up language. Color me unimpressed.

    Yes, if they'd done that they could have completed the whole project in six weeks...

  • Blowfish (unregistered)

    All the nice RegEx's in the beginning... (btw, one running one of them twice only removes two attributes from the tags, so if there are more...), and then:

        html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);
        html = html.Replace("

    ", ""); html = html.Replace("

    ", ""); // snip repeats html = html.Replace("\n", ""); html = html.Replace("\r", ""); html = html.Replace("
    ", ""); html = html.Replace("
    ", ""); // some more snipping html = html.Replace("
     
    ", ""); html = html.Replace("

     

    ", ""); html = html.Replace("

     

    ", ""); while (true) { if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<"); }
    1. The first one has the potential to completely break the html code
    2. So inserting
          RegEx.Replace(html, @"<(div|p)\s*>(\s+| )\s*</\1>", "", RegExOptions.IgnoreCase)
      was too difficult? (Well, I assume it was a different coder...)
    3. The if is only a minor wtf, seeing as instead of
          if (!test())
      s/he wrote
          if (!(test() || test()))
      ... which is exactly the same thing.
    4. So, removing line breaks for obfuscation?

    CAPTCHA: conventio ... it is truly against convention what the coder(s) is/are trying to do here.

  • Pantsman (unregistered) in reply to XXXXX

    He even stuffed up the if statement at the end, he has an OR function on two identical strings.

  • Kohlrak (unregistered) in reply to Steve
    Steve:
    Hey! This code was developed in my company!

    And of course that is the best approach to do it. We had a close deadline, get it? In fact, one thing that i recommended to speed up the process is to stop using open source projects and start developing in-house libraries we could use.

    That way we managed to keep the development time down to Six Weeks!

    Oh, you're good.

  • Database Troll (unregistered) in reply to trwww
    trwww:
    The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.

    Provided your purpose is to generate a plan text equivalent, not to prevent malicious users injecting rogue html tags:

    <script>exploit.run()</script>

    Run the above throgh SAX and you'll probably get:

    <script>exploit.run()</script>
  • Database Troll (unregistered) in reply to trwww
    trwww:
    The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.
  • An alternate explanation (unregistered)

    And in case anyone cares. The correct solution is to download jsoup and then do a

    	Whitelist whiteList=new Whitelist();
        whiteList.addTags("br","b", "em", "i", "strong", "u","p","a","li","ul","ol","h1","h2","h3","h4","h5","h6");
        whiteList.addProtocols("a", "href","http","https","mailto","tel");
        whiteList.addAttributes("a", "href");
    	String result=Jsoup.clean(value,whiteList);
    
  • Seth (unregistered)

    Gotta love the double checking nature even at the most elementary level:

    (html.Contains(" >") || html.Contains(" >")
    

    I'm sure they thought of the possibility of the value of the HTML string changing on another thread

  • Peter (unregistered) in reply to Silverhill
    Silverhill:
    if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<");
    So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.
    Um, no. Read the expression again, paying careful attention to where the parenthesis after "!" is closed.
  • Anon (unregistered) in reply to TDWTF Reader

    Think you meant hovel rather than hobble ;)

  • not frits at all (unregistered) in reply to Seth
    Seth:
    Gotta love the double checking nature even at the most elementary level:
    (html.Contains(" >") || html.Contains(" >")
    I'm sure they thought of the possibility of the value of the HTML string changing on another thread
    I'm sure this was a failed copy/paste and should have been:
    (html.Contains(" >") || html.Contains("< ")
    I mean, who hasn't done something like this ?
  • (cs) in reply to not frits at all
    not frits at all:
    Seth:
    Gotta love the double checking nature even at the most elementary level:
    (html.Contains(" >") || html.Contains(" >")
    I'm sure they thought of the possibility of the value of the HTML string changing on another thread
    I'm sure this was a failed copy/paste and should have been:
    (html.Contains(" >") || html.Contains("< ")
    I mean, who hasn't done something like this ?

    Ha, I see what you did there. Clever.

    Refactored to catch all scenarios:

    if (!(html.Contains(" >") || html.Contains(" >"))) 
    && if ((html.Contains(" >") == false || (html.Contains(" >") == true))
  • Anon (unregistered)

    I'm pretty sure they wouldn't have to do all this if they were using BobX..

  • Umm (unregistered)

    So why were half of yesterdays comments truncated?

  • trwtf (unregistered) in reply to Umm
    Umm:
    So why were half of yesterdays comments truncated?
    Off-topic comments are routinely deleted by the editors. The deleted comments had nothing to do with the article.
  • mr guy (unregistered)

    good old copy-paste: if (!(html.Contains(" >") || html.Contains(" >")))

  • (cs)

    Glad Alex, took care of Mr V1Agra. Hopefully this comment will post.

  • Mike D. (unregistered) in reply to Thats not WTF, its reality.
    Thats not WTF:
    Some programs cant write even valid html, so someone has to fix the errors of the others. Good example is the mess that M$ Word creates when you save it as html..
    Office doesn't even do a good job of saving in its supposedly native XML format.

    My Mom wrote something on Office 2009 or so, saved it as .docx (it was the default, and then took it to my uncle's place where he had some old XP machine running WordPerfect. Needless to say, nothing would open the file, and my uncle didn't want me installing anything on that machine.

    So I opened the file in Notepad, saw that the first two letters were "PK", changed the file's extension to .zip, opened that, and found the actual document file (as opposed to the six other files that held settings and whatever else).

    Egads.

    Bear in mind that the original file just had default font, style, page, and other settings. It should have been pretty much just a text file. But no.

    Every single "misspelled" word (i.e. every proper name) was wrapped in a five-tag-deep stack: one tag to mark it as a misspelled word and the others to reapply Arial 13 Regular single-spaced left-justified or whatever. Every single "grammar error" got the same treatment. Bad grammar and spelling? Let's just say I got to see how far that rabbit hole went.

    Worse yet, WordPerfect didn't have a search-and-replace that supported wildcards, only exact matches. (Remember, not allowed to install software, so WP, Notepad, and Wordpad were all that was available.) Mercifully, it had an "extend selection to search term" checkbox. It still took 15 minutes just to strip off all the tags and figure out which one (I think it was "w:p") marked off paragraphs.

    It took Microsoft over 6000 pages to describe this file format as a standard. Ugh.

  • James (unregistered) in reply to np

    That looks like a bug, it should probably read:

    if (!(html.Contains(" >") || html.Contains(" <")))

  • John Smith (unregistered) in reply to trwww

    "The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events."

    Until your SAX processor runs into a lone BR tag and bombs because it isn't correct XML.

  • gallier2 (unregistered)

    Two pages of comments and nobody has pointed out yet that the RWTF is to use Regex on a SGML derivatives. WTF

    HTML and XML are not context free and can not be properly parsed by a regular expression.

    For example this is valid html:

    <tag attrib="abc>de">

    or this is valid XML

    <hello>>and now>>>>>></hello>

    http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

  • Regex blank stare. (unregistered)

    I stop and stare at the very first regex:

    [^>]*?

    This allows zero or more characters (that are not an end bracket), and makes it optional too just in case nothing isn't there either. I think.

    nulla: nienta, nada

  • tragomaskhalos (unregistered)

    Apart from the obvious OMGs, what's with this?

    html = html.Replace("

    ", ""); html = html.Replace("

    ", ""); html = html.Replace("

    ", ""); html = html.Replace("

    ", "");
    All those headbanging regexes above and yet no \s* here ?!
  • (cs) in reply to gallier2
    gallier2:
    Two pages of comments and nobody has pointed out yet that the RWTF is to use Regex on a SGML derivatives. WTF

    HTML and XML are not context free and can not be properly parsed by a regular expression.

    For example this is valid html:

    <tag attrib="abc>de">

    or this is valid XML

    <hello>>and now>>>>>></hello>

    http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

    That was actually my point as well as that of the guy who commented about SAX. But whatever.
  • (cs) in reply to Nagesh
    Nagesh:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

    I was actually quoting or paraphrasing South Park the whole time I was talking to you.

  • Anonymous (unregistered) in reply to hoodaticus
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

    I was actually quoting or paraphrasing South Park the whole time I was talking to you.
    5545... 554522... 5545225654

    Yeah! Thats the tune to Funky Town!

  • (cs) in reply to Mike D.
    Mike D.:
    Office doesn't even do a good job of saving in its supposedly native XML format.

    My Mom wrote something on Office 2009 or so, saved it as .docx (it was the default, and then took it to my uncle's place where he had some old XP machine running WordPerfect. Needless to say, nothing would open the file, and my uncle didn't want me installing anything on that machine.

    So I opened the file in Notepad, saw that the first two letters were "PK", changed the file's extension to .zip, opened that, and found the actual document file (as opposed to the six other files that held settings and whatever else).

    Egads.

    Bear in mind that the original file just had default font, style, page, and other settings. It should have been pretty much just a text file. But no.

    Every single "misspelled" word (i.e. every proper name) was wrapped in a five-tag-deep stack: one tag to mark it as a misspelled word and the others to reapply Arial 13 Regular single-spaced left-justified or whatever. Every single "grammar error" got the same treatment. Bad grammar and spelling? Let's just say I got to see how far that rabbit hole went.

    It took Microsoft over 6000 pages to describe this file format as a standard. Ugh.

    Unfortunately for Microsoft, it has to support all legacy options and quirks of old WORDs. Most of those 6000 pages are those strange options.

  • (cs)

    I wrote an XHTML sanitizer for j0rb. Emphasis on the X (if that wasn't clear). I wasn't sure how slow it would be, but nonetheless, I load the document with a parser (requiring well-formed XML), and then walk every node in the tree. It supports two levels of "bad": remove and abort. Elements that are considered harmless, but undesirable, are replaced by the inner text. Attributes that are considered harmless, but undesirable, are removed entirely. Elements and attributes that are considered potentially malicious (<script>, onanything attributes, etc.) just log the occurrence, throw an exception, and refuse the data until the user fixes it. Much to my surprise, it actually seems to perform pretty decently. I'm not saying that it's the right way, but it's certainly a lot more right than this. :P

  • (cs) in reply to hoodaticus
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

    I was actually quoting or paraphrasing South Park the whole time I was talking to you.

    I have]not seen South Park. I thought you were name calling me.
  • (cs) in reply to gallier2
    gallier2:
    or this is valid XML

    <hello>>and now>>>>>></hello>

    Actually, that's not even well-formed XML, let alone valid. Well-formed XML requires the special-characters like '<' and '>' to be used only for defining syntax (i.e., directives, tags, CDATA sections, comments). They also have to match up perfectly, and that goes for nested tags too. Additionally, attributes must all be quoted properly; and all other elements must be contained in a single, root element.

    A valid XML document must be well-formed and also needs to contain a DOCTYPE declaration and must conform to it precisely.

    If you wanted the text ">and now>>>>>>" to appear in the <hello> element of your well-formed XML document then you would need to escape the '>' characters:

    <hello>>and now>>>>>></hello>

    Or put them in a CDATA section:

    <hello><![CDATA[>and now>>>>>>]></hello>

    That said, you still can't reliably parse XML (nor any SGML dialect that I am aware of) with a regular expression. It's simply not possible.

  • (cs) in reply to Mike D.
    Mike D.:
    It took Microsoft over 6000 pages to describe a standard that it doesn't even use in its production software. Ugh.

    FTFY.

  • (cs) in reply to Anonymous
    Anonymous:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

    I was actually quoting or paraphrasing South Park the whole time I was talking to you.
    5545... 554522... 5545225654

    Yeah! Thats the tune to Funky Town!

    EEDE BA#A EGE

  • gallier2 (unregistered) in reply to xtremezone
    xtremezone:
    gallier2:
    or this is valid XML

    <hello>>and now>>>>>></hello>

    Actually, that's not even well-formed XML, let alone valid. Well-formed XML requires the special-characters like '<' and '>' to be used only for defining syntax (i.e., directives, tags, CDATA sections, comments). They also have to match up perfectly, and that goes for nested tags too. Additionally, attributes must all be quoted properly; and all other elements must be contained in a single, root element.

    Bzzzt, wrong. < is required to be replaced by a character entity, > not. Don't believe me? Look for yourself

    http://www.w3.org/TR/REC-xml/#dt-chardata

    2.4 Character Data and Markup

    Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).]

    [Definition: All text that is not markup constitutes the character data of the document.]

    The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

    In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

    To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' ", and the double-quote character (") as " " ". Character Data [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

  • ÃƆ(unregistered)
    Uh-oh, I think I found a problem with their cleaner-upper

    Signed, The Real Nagesh

  • (cs) in reply to xtremezone
    xtremezone:
    That said, you still can't reliably parse XML (nor any SGML dialect that I am aware of) with a regular expression. It's simply not possible.

    It's definitely not possible. Regular expressions can parse regular grammars. XML is not a regular grammar. It's not even context-free, as is often claimed. You can use the pumping lemma to prove it both of these claims.

    You can use a "completion" of regular expression parsing to parse context-free grammars, but it is nasty. (To complete a category, you embed it in a larger category that contains all of the first category's limits) Basically, you have to use multi-pass parsing to handle arbitrary nesting by splitting on the "next" nested expression at the same level in the parse tree. It will be slow. If you're going to do this, you might as well be using recursion and/or a good parser generator. All the hard stuff will be hidden away in the parser generator's bowels or the one "tricky" recursive definition which captures the normal form for the computation. Recursion is easy anyway.

  • Devil's Advocate (unregistered) in reply to Blowfish
    Blowfish:
    <snip>
    1. The if is only a minor wtf, seeing as instead of
          if (!test())
      s/he wrote
          if (!(test() || test()))
      ... which is exactly the same thing.
    <snip>

    No. The result is potentially the same, however because test() is a method, not a variable they are not the same thing. Consider:

    static int myTestVar = 0;
    
    public bool test()
    {
      return (myTestVar++%2 == 0);
    }
    

    A more complicated example might be less predictable. What if "test()" simply runs the next test case in a file?

    although the two test() calls appear funny, there are plausible explanations to what is going on (that's not to say it's necessarily good practice).

  • Devil's Advocate (unregistered) in reply to Devil's Advocate
    Devil's Advocate:
    Blowfish:
    <snip>
    1. The if is only a minor wtf, seeing as instead of
          if (!test())
      s/he wrote
          if (!(test() || test()))
      ... which is exactly the same thing.
    <snip>

    No. The result is potentially the same, however because test() is a method, not a variable they are not the same thing. Consider:

    static int myTestVar = 0;
    
    public bool test()
    {
      return (myTestVar++%2 == 0);
    }
    

    A more complicated example might be less predictable. What if "test()" simply runs the next test case in a file?

    although the two test() calls appear funny, there are plausible explanations to what is going on (that's not to say it's necessarily good practice).

    OOPS - let me retract that, I read your code rather than the code in the article.....My Bad!!

  • Jason (unregistered) in reply to trwtf
    trwtf:
    Umm:
    So why were half of yesterdays comments truncated?
    Off-topic comments are routinely deleted by the editors. The deleted comments had nothing to do with the article.

    Plus there seemed to be a thread starting about V1agra

  • (cs) in reply to Blowfish
    Blowfish:
    All the nice RegEx's in the beginning... (btw, one running one of them twice only removes two attributes from the tags, so if there are more...)

    I thought of that, too. So I did this:

       

    That got it into production the first time...

Leave a comment on “Squeaky-Clean HTML”

Log In or post as a guest

Replying to comment #:

« Return to Article