Squeaky-Clean HTML

Sami's company developed Yet Another Proprietary CMS and, like many home-grown systems, they've had to reinvent many wheels.

Naturally, their wheels aren't perfectly circular... and often times, they won't even fit into the most liberal definition of "round". But to their credit, the developers do try their best to be as thorough as possible. And they take cleaning HTML very seriously.

public string CleanHtml(string html)
{
    // start by completely removing all unwanted tags 
    html = Regex.Replace(html, 
       @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", 
       "", 
       RegexOptions.IgnoreCase);

    // then run another pass over the html (twice), removing unwanted attributes 
    html = Regex.Replace(html,
       @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", 
       "<$1$2>", 
       RegexOptions.IgnoreCase);
    html = Regex.Replace(html,
       @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", 
       "<$1$2>", 
       RegexOptions.IgnoreCase);

    html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);

    html = html.Replace("<p></p>", "");
    html = html.Replace("<p ></p>", "");
    html = html.Replace("<p  ></p>", "");
    html = html.Replace("<p   ></p>", "");
    html = html.Replace("<p    ></p>", "");
    html = html.Replace("<P></P>", "");
    html = html.Replace("<P ></P>", "");
    html = html.Replace("<P  ></P>", "");
    html = html.Replace("<P   ></P>", "");
    html = html.Replace("<P    ></P>", "");
    html = html.Replace("\n", "");
    html = html.Replace("\r", "");
    html = html.Replace("<DIV></DIV>", "");
    html = html.Replace("<DIV ></DIV>", "");
    html = html.Replace("<DIV  ></DIV>", "");
    html = html.Replace("<DIV   ></DIV>", "");
    html = html.Replace("<DIV    ></DIV>", "");
    html = html.Replace("<DIV>&nbsp;</DIV>", "");
    html = html.Replace("<P>&nbsp;</P>", "");
    html = html.Replace("<P>&nbsp; </P>", "");

    while (true)
    {
        if (!(html.Contains(" >") || html.Contains(" >")))
            break;
        html = html.Replace(" >", ">");
        html = html.Replace("< ", "<");
    }

    return html;
}

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

Squeaky-Clean HTML

Featured Comments