• (cs) in reply to Peter

    (deleted by user because of formatting error)

  • (cs) in reply to Peter
    Peter:
    Silverhill:
    So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.
    Um, no. Read the expression again, paying careful attention to where the parenthesis after "!" is closed.
    Oops! Proper parenthesis parsing prevents potential pitfalls.... Thanx for the catch, Peter.
  • Peter (unregistered) in reply to Silverhill
    Silverhill:
    Peter:
    Silverhill:
    So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.
    Um, no. Read the expression again, paying careful attention to where the parenthesis after "!" is closed.
    Oops! Proper parenthesis parsing prevents potential pitfalls.... Thanx for the catch, Peter.
    Glad to be of service ;)
  • (cs) in reply to frits
    frits:
    Clearly they've missed a few:
    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong> "<\strong>");
    

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong> "<\strong>");

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong><\strong> "<\strong>");

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong><\strong><\strong> "<\strong>");

    Given the nature of this site, don't you think it would have been a better idea to use forward slashes instead of backslashes? They use the slashes consistently correctly; you use it consistently incorrectly.

    You develop in VB, too?

  • Anonymous (unregistered) in reply to frits
    frits:
    Anonymous:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    hoodaticus:
    Nagesh:
    Christopher:
    hoodaticus:
    TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

    I'm assuming you are an idiot, in the most polite sense.

    http://www.w3schools.com/HTMLDOM/dom_intro.asp

    Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

    Either way... I still think you are am idiot.

    He's mostly a troll.

    You're a towel!

    Now you're confusing Arabs with Indians. What's next?

    Just let me get high. Then I'll remember the difference between Arabs and Indians.

    Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

    I was actually quoting or paraphrasing South Park the whole time I was talking to you.
    5545... 554522... 5545225654

    Yeah! Thats the tune to Funky Town!

    EEDE BA#A EGE

    Much to my annoyance, it's not actually possible to play the tune to Funky Town on a touch-tone phone. I've been trying for a good 20 minutes now and all I've managed to do is accidentally order a pizza.
  • (cs) in reply to Severity One
    Severity One:
    frits:
    Clearly they've missed a few:
    html = html.Replace(" "");
    html = html.Replace("<\strong><\strong> "<\strong>");
    

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong> "<\strong>");

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong><\strong> "<\strong>");

    html = html.Replace(" ""); html = html.Replace("<\strong><\strong><\strong><\strong><\strong> "<\strong>");

    . Given the nature of this site, don't you think it would have been a better idea to use forward slashes instead of backslashes? They use the slashes consistently correctly; you use it consistently incorrectly.

    You develop in VB, too?

    Mostly C derivatives, thanks for asking. It was a copy-pasta mistake, so sue me.

  • (cs) in reply to gallier2
    gallier2:
    Bzzzt, wrong. < is required to be replaced by a character entity, > not.
    I ... I didn't know. :'(

    Thanks for pointing that out. :)

  • gallier2 (unregistered) in reply to xtremezone
    xtremezone:
    gallier2:
    Bzzzt, wrong. < is required to be replaced by a character entity, > not.
    I ... I didn't know. :'(

    Thanks for pointing that out. :)

    If I hadn't got such a document last wednesday containing untransformed >, I would not have known either. It happened that I just checked the w3c site on that day.

  • xmILL (unregistered)

    while(true) way to go!

    well but cleaning html is really a bitch unless you do document.firstChild.innerText()

  • Luiz Felipe (unregistered) in reply to Mike D.

    Perhaps it uses to describe some undo operation.

  • Luiz Felipe (unregistered) in reply to Mike D.

    Perhaps it uses to describe some undo operation.

  • (cs) in reply to Luiz Felipe
    Luiz Felipe:
    Perhaps it uses to describe some undo operation.
    I think it may be a side-effect of that, actually. MS almost certainly uses command pattern in Office. Maybe that mindset leaked into the serialization format.
  • valetudo (unregistered) in reply to nobulate
    nobulate:
    not frits at all:
    Seth:
    Gotta love the double checking nature even at the most elementary level:
    (html.Contains(" >") || html.Contains(" >")
    I'm sure they thought of the possibility of the value of the HTML string changing on another thread
    I'm sure this was a failed copy/paste and should have been:
    (html.Contains(" >") || html.Contains("< ")
    I mean, who hasn't done something like this ?

    Ha, I see what you did there. Clever.

    Refactored to catch all scenarios:

    if (!(html.Contains(" >") || html.Contains(" >"))) 
    && if ((html.Contains(" >") == false || (html.Contains(" >") == true) || (html.Contains(" >") == FILE_NOT_FOUND))
    You forgot one. FTFY.
  • DYo (unregistered)

    Hmm. Let me guess that we found this gem of code when it there were either complaints about performance on the servers or it hit an infinite loop.

    "\s*" or set the ignore whitespace flag ftw... but there are so many other problems. with everything. Programming isn't like doodling notes in you notebook.

  • Layla (unregistered)

    So basically there's 5 tags they want to remove and 5 attributes types. And they don't want white space created by div or P tags. Wow, these programmers are soo stupid. They left out thousands of combinations. There could easily be another 10,000+ lines of code for this.

  • DYo (unregistered) in reply to le stuff
    le stuff:
    Power Troll:
    Wasn't perl created for doing stuff like this?
    "for doing stuff" seems unnecessary...
    so you think: "Wasn't perl created like this?"

    lol

  • glompix (unregistered)

    The center cannot hold!

    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Leave a comment on “Squeaky-Clean HTML”

Log In or post as a guest

Replying to comment #:

« Return to Article