• Rhywden (unregistered)

    Okay, so they're able to use Regex but at the same time not.

    And since when are tags like "" and attributes like "class" subject to removal? <font> and deprecated stuff I can understand...

  • blah (unregistered)

    Are you sure this 'CMS' isn't Community Server?

  • XXXXX (unregistered)

    They didn't write their own reg-ex language. Much less their own reg-ex processor. At this rate they'll never develop their own in-house, proprietary mark-up language. Color me unimpressed.

  • Anonymous (unregistered)

    I believe in this case it stands for "Clearly Must Stop"

  • boog (unregistered)

    I'm pretty sure I would have strangled the writer of this before I hit the "span".

  • eddoe (unregistered)

    My intro CS class totally taught this way of HTML clean-up.

  • (cs)

    The first thing I noticed is that <script> is totally fine to have with this cleaner.

  • (cs)
    html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);

    The hell? That won't clean HTML, it will break it.

  • Anonymous (unregistered)

    I like the way it removes all the tag attributes first, thereby making the job of removing the actual tag a whole lot harder. But that's no problem, we'll just do a search and replace on "

    " and "<DIV >" and "<DIV  >" and "<DIV   >" and "<DIV    >" and that'll do because it's not like the DIV tag supports any more than four attributes. Wait, what...?

  • (cs)

    The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.

  • (cs)

    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

  • (cs)

    Clearly they've missed a few:

    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong> "<\strong>");
    
    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong><\strong> "<\strong>");
    
    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong><\strong><\strong> "<\strong>");
    
    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong><\strong><\strong><\strong> "<\strong>");
    
    
  • Nagesh Kukunoor (unregistered) in reply to Zylon
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    The Daily WTF- Curious Perversions of Written English

  • Anon (unregistered)

    I think I can clean the HTML better than this...

    public string CleanHtml(string html)
    {
        return "";
    }
  • Skilldrick (unregistered) in reply to Zylon
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

  • (cs)

    Wasn't perl created for doing stuff like this?

  • Jason Hooper (unregistered)

    Slowly put the Regex class on the ground and step away with your hands above your head.

  • np (unregistered)

    if (!(html.Contains(" >") || html.Contains(" >")))

    I like it when code looks like 1 || 1. Maybe they thought the first html.Contains(" >") wasn't enough to find those pesky " >".

  • anon (unregistered)

    I apologize... I, I just couldn't help myself.

    if (!(html.Contains(comment) || html.COntains(comment))) out << "I'm Frist!";

  • Rajendra Kumar (unregistered) in reply to Nagesh Kukunoor
    Nagesh Kukunoor:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    The Daily WTF- Curious Interpretations of Written English

    FTFY

  • (cs)

    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

  • (cs) in reply to Skilldrick
    Skilldrick:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

    It's only weird if you never had Calculus III. Or Algebra II. Or Geometry.

  • XXXXX (unregistered) in reply to hoodaticus
    hoodaticus:
    Skilldrick:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

    It's only weird if you never had Calculus III. Or Algebra II. Or Geometry.

    So... conic sections like hyperbolas, parabolas, and elipses aren't rounded? I'm like 80% sure they aren't circular.

  • Christopher (unregistered) in reply to hoodaticus
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

  • (cs) in reply to Christopher
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    Stick to your day job.

  • (cs) in reply to Anon
    Anon:
    I think I can clean the HTML better than this...
    public string CleanHtml(string html)
    {
        return "";
    }

    You have got rid of everything! The correct way is to apply a xsl based template to get rid of the HTML tags.

  • Gary (unregistered)

    I totally get this code. A few weirdnesses but a marginal WTF.

    I would have thought that behind this all, there was a "paste from Word into rich text editor" problem, but there's no smart-quote fixing.... Is it catching those notorious unquoted attributes that Word likes to throw into html?

    I used this in conjunction with ckeditor once:

    if (elm.firstChild.innerHTML.indexOf("Word.Document")>0) elm.removeChild(elm.firstChild);
    }
    

    because Word pops a big ugly stylesheet into the first

    element, and then I had to go and substitute html entities for all the Windows-encoded crap characters. Ugh.

  • (cs) in reply to Christopher
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    He's mostly a troll.

  • Adriano (unregistered) in reply to Skilldrick
    Skilldrick:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

    In fact, it's not. It's pointing out that, while most of the time things were perfect, sometimes they were beyond the pale. It's using round as a synonym of circular, to avoid repetition.

  • Weird Man (unregistered)

    I was expecting something along the lines of,

    public String cleanHTML(String html){
               return html.trim();
    }
    
  • (cs) in reply to XXXXX
    XXXXX:
    hoodaticus:
    Skilldrick:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

    It's only weird if you never had Calculus III. Or Algebra II. Or Geometry.

    So... conic sections like hyperbolas, parabolas, and elipses aren't rounded? I'm like 80% sure they aren't circular.

    How about those ladies in Rubens paintings? wink wink

  • TDWTF Reader (unregistered) in reply to Power Troll
    Power Troll:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    Stick to your day job.

    Mmmmkay. Such inspiring advice! I must assume you are in a profession that permits such life guiding instruction. Your five word sentence clearly dictates that. Are you a psychiatrist or a psychologist? No wait. You must be an underpaid public school guidance councillor. No, I got it, you are a janitor, who left their 6 figure Wall Street career because you though the streets were dirty. Either that, you might just be some piece of nearly congealed flesh that managed to string together a few words that have no meaning or correlation to anything I said.

    Whatever you do, please keep your day job as well, because I wouldn't want to meet you in public after you climbed out of your cardboard box or dumpster or whatever hobble you live in.

  • (cs) in reply to Nagesh
    Nagesh:
    He's mostly a troll.
    Pot, meet kettle.
  • JAVA Pro (unregistered)
    public string cleanHTML(string html){
           return html.trim();
    }
    public string cleanerHTML(string html){
           return cleanHTML(html).trim();
    }
    
  • (cs) in reply to TDWTF Reader
    TDWTF Reader:
    Power Troll:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    Stick to your day job.

    Mmmmkay. Such inspiring advice! I must assume you are in a profession that permits such life guiding instruction. Your five word sentence clearly dictates that. Are you a psychiatrist or a psychologist? No wait. You must be an underpaid public school guidance councillor. No, I got it, you are a janitor, who left their 6 figure Wall Street career because you though the streets were dirty. Either that, you might just be some piece of nearly congealed flesh that managed to string together a few words that have no meaning or correlation to anything I said.

    Whatever you do, please keep your day job as well, because I wouldn't want to meet you in public after you climbed out of your cardboard box or dumpster or whatever hobble you live in.

    Cool story.

  • (cs) in reply to boog
    boog (cheap imitation):
    I'm pretty sure I would have strangled the writer of this before I hit the "span".
    I'm pretty sure all you're capable of doing is reinventing the wheel, only not perfectly circular and often times far from it.
  • Bob (unregistered) in reply to Gary
    Gary:
    I would have thought that behind this all, there was a "paste from Word into rich text editor" problem, but there's no smart-quote fixing.... Is it catching those notorious unquoted attributes that Word likes to throw into html?
    That looks like it. Here's the original version of the code: http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx
  • (cs) in reply to XXXXX
    XXXXX:
    hoodaticus:
    Skilldrick:
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    There's nothing wrong with that sentence. It's like saying "She wasn't beautiful, and often she was worse than ugly". It's saying they're not circular, and sometimes they're not even round. What's weird about that?

    It's only weird if you never had Calculus III. Or Algebra II. Or Geometry.

    So... conic sections like hyperbolas, parabolas, and elipses aren't rounded? I'm like 80% sure they aren't circular.

    Conversing with intelligent life on the internet is a decadence these days. Thank you.

  • (cs) in reply to TDWTF Reader
    TDWTF Reader:
    Power Troll:
    Christopher:
    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.
    Stick to your day job.
    Mmmmkay. Such inspiring advice! I must assume blah blah bitch bitch blah... ...you might just be some piece of nearly congealed flesh that managed to string together a few words that have no meaning or correlation to anything I said.
    And why should they relate to anything you said; he wasn't talking to you.

    Or do you no longer care about maintaining anonymity?

  • (cs) in reply to Nagesh
    Nagesh:
    Anon:
    I think I can clean the HTML better than this...
    public string CleanHtml(string html)
    {
        return "";
    }

    You have got rid of everything! The correct way is to apply a xsl based template to get rid of the HTML tags.

    That is the first thing I remember you saying that was actually smart. Good job.

  • (cs) in reply to Nagesh
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

  • (cs) in reply to hoodaticus
    hoodaticus:
    Nagesh:
    Anon:
    I think I can clean the HTML better than this...
    public string CleanHtml(string html)
    {
        return "";
    }

    You have got rid of everything! The correct way is to apply a xsl based template to get rid of the HTML tags.

    That is the first thing I remember you saying that was actually smart. Good job.

    This is the first time you realized that I am a smart guy. Keep up the good work!

  • (cs) in reply to hoodaticus
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

  • Childish (unregistered) in reply to trwww
    trwww:
    The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.

    ...and print the SAX parse exceptions traceback!

  • (cs) in reply to Nagesh
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

  • Steve (unregistered)

    Hey! This code was developed in my company!

    And of course that is the best approach to do it. We had a close deadline, get it? In fact, one thing that i recommended to speed up the process is to stop using open source projects and start developing in-house libraries we could use.

    That way we managed to keep the development time down to Six Weeks!

  • (cs) in reply to hoodaticus
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

  • Taco (unregistered)

    I think TRWTF is that while/if/break structure towards the end.

  • (cs) in reply to Zylon
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    an imperfect circle could still be considered round.

  • Meh (unregistered) in reply to Zylon
    Zylon:
    "Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round"."

    Woah... they're not circular AND not round? What are the odds?

    "not perfectly circular.....won't even fit most liberal definition of round"

    I don't think there's a problem....

Leave a comment on “Squeaky-Clean HTML”

Log In or post as a guest

Replying to comment #:

« Return to Article