The Daily WTF: Curious Perversions in Information Technology

2011-02-09 Reply Admin

np:
if (!(html.Contains(" >") || html.Contains(" >")))
I like it when code looks like 1 || 1. Maybe they thought the first html.Contains(" >") wasn't enough to find those pesky " >".

They had to pass over twice to remove attributes, so maybe they feel they gotta do this twice too?

2011-02-09 Reply Admin

Power Troll:
Wasn't perl created for doing stuff like this?

"for doing stuff" seems unnecessary...

@Deprecated · 2011-02-09 Reply Admin

Most vehicular wheels aren't perfectly circular... they're a little bit flat on the bottom.

2011-02-09 Reply Admin

Some programs cant write even valid html, so someone has to fix the errors of the others. Good example is the mess that M$ Word creates when you save it as html..

horrible stuff. indeed.

skington · 2011-02-09 Reply Admin

I like how they'll remove style="stuff" but not style = "stuff" - despite all of their attempts at stripping multiple trailing spaces from p and div tags.

2011-02-09 Reply Admin

Here's part of my solution for "cleaning" HTML: https://github.com/rowan-lewis/wiki/blob/master/wiki/libs/html.php

So yeah, it uses XPath, Lib Tidy, regular expressions and then XSLT to output only the desired markup.

2011-02-09 Reply Admin

Ouch. Just ouch.

2011-02-09 Reply Admin

brunascle:
html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);
The hell? That won't clean HTML, it will break it.

Especially with IgnoreCase set ... after all, those uppercase blanks are there for a purpose.

2011-02-09 Reply Admin

It's said when someone makes a joke about doing something and then you remember that you did do that once in the past.

2011-02-09 Reply Admin

The center cannot hold!

Silverhill · 2011-02-09 Reply Admin

if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<");

So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.

2011-02-09 Reply Admin

XXXXX:
They didn't write their own reg-ex language. Much less their own reg-ex processor. At this rate they'll never develop their own in-house, proprietary mark-up language. Color me unimpressed.

Yes, if they'd done that they could have completed the whole project in six weeks...

2011-02-09 Reply Admin

All the nice RegEx's in the beginning... (btw, one running one of them twice only removes two attributes from the tags, so if there are more...), and then:

    html = Regex.Replace(html, @">\s+<", "", RegexOptions.IgnoreCase);
    html = html.Replace("", "");
    html = html.Replace("", "");
        // snip repeats
    html = html.Replace("\n", "");
    html = html.Replace("\r", "");
    html = html.Replace("", "");
    html = html.Replace("", "");
        // some more snipping
    html = html.Replace(" ", "");
    html = html.Replace(" ", "");
    html = html.Replace("  ", "");

    while (true)
    {   if (!(html.Contains(" >") || html.Contains(" >")))
            break;
        html = html.Replace(" >", ">");
        html = html.Replace("< ", "<");
    }

The first one has the potential to completely break the html code

So inserting

    RegEx.Replace(html, @"<(div|p)\s*>(\s+| )\s*</\1>", "", RegExOptions.IgnoreCase)

was too difficult? (Well, I assume it was a different coder...)

The if is only a minor wtf, seeing as instead of
```
    if (!test())
```
s/he wrote
```
    if (!(test() || test()))
```
... which is exactly the same thing.
So, removing line breaks for obfuscation?

CAPTCHA: conventio ... it is truly against convention what the coder(s) is/are trying to do here.

2011-02-09 Reply Admin

He even stuffed up the if statement at the end, he has an OR function on two identical strings.

2011-02-09 Reply Admin

Steve:
Hey! This code was developed in my company!
And of course that is the best approach to do it. We had a close deadline, get it? In fact, one thing that i recommended to speed up the process is to stop using open source projects and start developing in-house libraries we could use.

That way we managed to keep the development time down to Six Weeks!

Oh, you're good.

2011-02-10 Reply Admin

trwww:
The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.

Provided your purpose is to generate a plan text equivalent, not to prevent malicious users injecting rogue html tags:

Run the above throgh SAX and you'll probably get:

2011-02-10 Reply Admin

trwww:
The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events.

2011-02-10 Reply Admin

And in case anyone cares. The correct solution is to download jsoup and then do a

	Whitelist whiteList=new Whitelist();
    whiteList.addTags("br","b", "em", "i", "strong", "u","p","a","li","ul","ol","h1","h2","h3","h4","h5","h6");
    whiteList.addProtocols("a", "href","http","https","mailto","tel");
    whiteList.addAttributes("a", "href");
	String result=Jsoup.clean(value,whiteList);

2011-02-10 Reply Admin

Gotta love the double checking nature even at the most elementary level:

(html.Contains(" >") || html.Contains(" >")

I'm sure they thought of the possibility of the value of the HTML string changing on another thread

2011-02-10 Reply Admin

Silverhill:
if (!(html.Contains(" >") || html.Contains(" >"))) break; html = html.Replace(" >", ">"); html = html.Replace("< ", "<");
So, whether or not the code contains " >", do the replace operation? Just to make real sure, I guess.

Um, no. Read the expression again, paying careful attention to where the parenthesis after "!" is closed.

2011-02-10 Reply Admin

Think you meant hovel rather than hobble ;)

2011-02-10 Reply Admin

Seth:
Gotta love the double checking nature even at the most elementary level:
(html.Contains(" >") || html.Contains(" >")
I'm sure they thought of the possibility of the value of the HTML string changing on another thread

I'm sure this was a failed copy/paste and should have been:

(html.Contains(" >") || html.Contains("< ")

I mean, who hasn't done something like this ?

nobulate · 2011-02-10 Reply Admin

not frits at all:
Seth:
Gotta love the double checking nature even at the most elementary level:
(html.Contains(" >") || html.Contains(" >")
I'm sure they thought of the possibility of the value of the HTML string changing on another thread
I'm sure this was a failed copy/paste and should have been:
(html.Contains(" >") || html.Contains("< ")
I mean, who hasn't done something like this ?

Ha, I see what you did there. Clever.

Refactored to catch all scenarios:

if (!(html.Contains(" >") || html.Contains(" >"))) 
&& if ((html.Contains(" >") == false || (html.Contains(" >") == true))

2011-02-10 Reply Admin

I'm pretty sure they wouldn't have to do all this if they were using BobX..

2011-02-10 Reply Admin

So why were half of yesterdays comments truncated?

2011-02-10 Reply Admin

Umm:
So why were half of yesterdays comments truncated?

Off-topic comments are routinely deleted by the editors. The deleted comments had nothing to do with the article.

2011-02-10 Reply Admin

good old copy-paste: if (!(html.Contains(" >") || html.Contains(" >")))

Nagesh · 2011-02-10 Reply Admin

Glad Alex, took care of Mr V1Agra. Hopefully this comment will post.

2011-02-10 Reply Admin

Thats not WTF:
Some programs cant write even valid html, so someone has to fix the errors of the others. Good example is the mess that M$ Word creates when you save it as html..

Office doesn't even do a good job of saving in its supposedly native XML format.

My Mom wrote something on Office 2009 or so, saved it as .docx (it was the default, and then took it to my uncle's place where he had some old XP machine running WordPerfect. Needless to say, nothing would open the file, and my uncle didn't want me installing anything on that machine.

So I opened the file in Notepad, saw that the first two letters were "PK", changed the file's extension to .zip, opened that, and found the actual document file (as opposed to the six other files that held settings and whatever else).

Egads.

Bear in mind that the original file just had default font, style, page, and other settings. It should have been pretty much just a text file. But no.

Every single "misspelled" word (i.e. every proper name) was wrapped in a five-tag-deep stack: one tag to mark it as a misspelled word and the others to reapply Arial 13 Regular single-spaced left-justified or whatever. Every single "grammar error" got the same treatment. Bad grammar and spelling? Let's just say I got to see how far that rabbit hole went.

Worse yet, WordPerfect didn't have a search-and-replace that supported wildcards, only exact matches. (Remember, not allowed to install software, so WP, Notepad, and Wordpad were all that was available.) Mercifully, it had an "extend selection to search term" checkbox. It still took 15 minutes just to strip off all the tags and figure out which one (I think it was "w:p") marked off paragraphs.

It took Microsoft over 6000 pages to describe this file format as a standard. Ugh.

2011-02-10 Reply Admin

That looks like a bug, it should probably read:

if (!(html.Contains(" >") || html.Contains(" <")))

2011-02-10 Reply Admin

"The fastest most dependable way to clean html from a string/file is to use SAX and only forward the characters events."

Until your SAX processor runs into a lone BR tag and bombs because it isn't correct XML.

2011-02-10 Reply Admin

Two pages of comments and nobody has pointed out yet that the RWTF is to use Regex on a SGML derivatives. WTF

HTML and XML are not context free and can not be properly parsed by a regular expression.

For example this is valid html:

or this is valid XML

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

2011-02-10 Reply Admin

I stop and stare at the very first regex:

[^>]*?

This allows zero or more characters (that are not an end bracket), and makes it optional too just in case nothing isn't there either. I think.

nulla: nienta, nada

2011-02-10 Reply Admin

Apart from the obvious OMGs, what's with this?

html = html.Replace("
", ""); html = html.Replace("
", ""); html = html.Replace("
", ""); html = html.Replace("
", "");

All those headbanging regexes above and yet no \s* here ?!

hoodaticus · 2011-02-10 Reply Admin

gallier2:
Two pages of comments and nobody has pointed out yet that the RWTF is to use Regex on a SGML derivatives. WTF
HTML and XML are not context free and can not be properly parsed by a regular expression.

For example this is valid html:
<tag attrib="abc>de">
or this is valid XML

<hello>>and now>>>>>></hello>

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

That was actually my point as well as that of the guy who commented about SAX. But whatever.

hoodaticus · 2011-02-10 Reply Admin

Nagesh:
hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
Christopher:
hoodaticus:
TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

I'm assuming you are an idiot, in the most polite sense.

http://www.w3schools.com/HTMLDOM/dom_intro.asp

Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

Either way... I still think you are am idiot.

He's mostly a troll.
You're a towel!

Now you're confusing Arabs with Indians. What's next?
Just let me get high. Then I'll remember the difference between Arabs and Indians.

Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.

I was actually quoting or paraphrasing South Park the whole time I was talking to you.

2011-02-10 Reply Admin

hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
Christopher:
hoodaticus:
TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

I'm assuming you are an idiot, in the most polite sense.

http://www.w3schools.com/HTMLDOM/dom_intro.asp

Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

Either way... I still think you are am idiot.

He's mostly a troll.
You're a towel!

Now you're confusing Arabs with Indians. What's next?
Just let me get high. Then I'll remember the difference between Arabs and Indians.

Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.
I was actually quoting or paraphrasing South Park the whole time I was talking to you.

5545... 554522... 5545225654

Yeah! Thats the tune to Funky Town!

alegr · 2011-02-10 Reply Admin

Mike D.:
Office doesn't even do a good job of saving in its supposedly native XML format.
My Mom wrote something on Office 2009 or so, saved it as .docx (it was the default, and then took it to my uncle's place where he had some old XP machine running WordPerfect. Needless to say, nothing would open the file, and my uncle didn't want me installing anything on that machine.

So I opened the file in Notepad, saw that the first two letters were "PK", changed the file's extension to .zip, opened that, and found the actual document file (as opposed to the six other files that held settings and whatever else).

Egads.

Bear in mind that the original file just had default font, style, page, and other settings. It should have been pretty much just a text file. But no.

Every single "misspelled" word (i.e. every proper name) was wrapped in a five-tag-deep stack: one tag to mark it as a misspelled word and the others to reapply Arial 13 Regular single-spaced left-justified or whatever. Every single "grammar error" got the same treatment. Bad grammar and spelling? Let's just say I got to see how far that rabbit hole went.

It took Microsoft over 6000 pages to describe this file format as a standard. Ugh.

Unfortunately for Microsoft, it has to support all legacy options and quirks of old WORDs. Most of those 6000 pages are those strange options.

xtremezone · 2011-02-10 Reply Admin

I wrote an XHTML sanitizer for j0rb. Emphasis on the X (if that wasn't clear). I wasn't sure how slow it would be, but nonetheless, I load the document with a parser (requiring well-formed XML), and then walk every node in the tree. It supports two levels of "bad": remove and abort. Elements that are considered harmless, but undesirable, are replaced by the inner text. Attributes that are considered harmless, but undesirable, are removed entirely. Elements and attributes that are considered potentially malicious (<script>, onanything attributes, etc.) just log the occurrence, throw an exception, and refuse the data until the user fixes it. Much to my surprise, it actually seems to perform pretty decently. I'm not saying that it's the right way, but it's certainly a lot more right than this. :P

Nagesh · 2011-02-10 Reply Admin

hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
Christopher:
hoodaticus:
TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

I'm assuming you are an idiot, in the most polite sense.

http://www.w3schools.com/HTMLDOM/dom_intro.asp

Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

Either way... I still think you are am idiot.

He's mostly a troll.
You're a towel!

Now you're confusing Arabs with Indians. What's next?
Just let me get high. Then I'll remember the difference between Arabs and Indians.

Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.
I was actually quoting or paraphrasing South Park the whole time I was talking to you.

I have]not seen South Park. I thought you were name calling me.

xtremezone · 2011-02-10 Reply Admin

gallier2:
or this is valid XML
<hello>>and now>>>>>></hello>

Actually, that's not even well-formed XML, let alone valid. Well-formed XML requires the special-characters like '<' and '>' to be used only for defining syntax (i.e., directives, tags, CDATA sections, comments). They also have to match up perfectly, and that goes for nested tags too. Additionally, attributes must all be quoted properly; and all other elements must be contained in a single, root element.

A valid XML document must be well-formed and also needs to contain a DOCTYPE declaration and must conform to it precisely.

If you wanted the text ">and now>>>>>>" to appear in the <hello> element of your well-formed XML document then you would need to escape the '>' characters:

<hello>>and now>>>>>></hello>

Or put them in a CDATA section:

<hello><![CDATA[>and now>>>>>>]></hello>

That said, you still can't reliably parse XML (nor any SGML dialect that I am aware of) with a regular expression. It's simply not possible.

nonpartisan · 2011-02-10 Reply Admin

Mike D.:
It took Microsoft over 6000 pages to describe a standard that it doesn't even use in its production software. Ugh.

FTFY.

frits · 2011-02-10 Reply Admin

Anonymous:
hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
hoodaticus:
Nagesh:
Christopher:
hoodaticus:
TRWTF is that there is no object model for HTML documents. I think I will write one and call it XMLDOM.

I'm assuming you are an idiot, in the most polite sense.

http://www.w3schools.com/HTMLDOM/dom_intro.asp

Or maybe you are completely aware of this, and were funny or sarcastic, and I missed it. That's how seriously I take you.

Either way... I still think you are am idiot.

He's mostly a troll.
You're a towel!

Now you're confusing Arabs with Indians. What's next?
Just let me get high. Then I'll remember the difference between Arabs and Indians.

Getting high is a means of escape. All of this is just instant gratification. Do not believe in instant gratification. It is bad for your soul.
I was actually quoting or paraphrasing South Park the whole time I was talking to you.
5545... 554522... 5545225654

Yeah! Thats the tune to Funky Town!

EEDE BA#A EGE

2011-02-10 Reply Admin

xtremezone:
gallier2:
or this is valid XML
<hello>>and now>>>>>></hello>
Actually, that's not even well-formed XML, let alone valid. Well-formed XML requires the special-characters like '<' and '>' to be used only for defining syntax (i.e., directives, tags, CDATA sections, comments). They also have to match up perfectly, and that goes for nested tags too. Additionally, attributes must all be quoted properly; and all other elements must be contained in a single, root element.

Bzzzt, wrong. < is required to be replaced by a character entity, > not. Don't believe me? Look for yourself

http://www.w3.org/TR/REC-xml/#dt-chardata

2.4 Character Data and Markup
Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).]

[Definition: All text that is not markup constitutes the character data of the document.]

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' ", and the double-quote character (") as " " ". Character Data [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

2011-02-10 Reply Admin

Uh-oh, I think I found a problem with their cleaner-upper

Signed, The Real Nagesh

Captain Oblivious · 2011-02-10 Reply Admin

xtremezone:
That said, you still can't reliably parse XML (nor any SGML dialect that I am aware of) with a regular expression. It's simply not possible.

It's definitely not possible. Regular expressions can parse regular grammars. XML is not a regular grammar. It's not even context-free, as is often claimed. You can use the pumping lemma to prove it both of these claims.

You can use a "completion" of regular expression parsing to parse context-free grammars, but it is nasty. (To complete a category, you embed it in a larger category that contains all of the first category's limits) Basically, you have to use multi-pass parsing to handle arbitrary nesting by splitting on the "next" nested expression at the same level in the parse tree. It will be slow. If you're going to do this, you might as well be using recursion and/or a good parser generator. All the hard stuff will be hidden away in the parser generator's bowels or the one "tricky" recursive definition which captures the normal form for the computation. Recursion is easy anyway.

2011-02-10 Reply Admin

Blowfish:
<snip>
The if is only a minor wtf, seeing as instead of
    if (!test())
s/he wrote
    if (!(test() || test()))
... which is exactly the same thing.
<snip>

No. The result is potentially the same, however because test() is a method, not a variable they are not the same thing. Consider:

static int myTestVar = 0;

public bool test()
{
  return (myTestVar++%2 == 0);
}

A more complicated example might be less predictable. What if "test()" simply runs the next test case in a file?

although the two test() calls appear funny, there are plausible explanations to what is going on (that's not to say it's necessarily good practice).

2011-02-10 Reply Admin

Devil's Advocate:
Blowfish:
<snip>
The if is only a minor wtf, seeing as instead of
    if (!test())
s/he wrote
    if (!(test() || test()))
... which is exactly the same thing.
<snip>
No. The result is potentially the same, however because test() is a method, not a variable they are not the same thing. Consider:
static int myTestVar = 0;

public bool test()
{
  return (myTestVar++%2 == 0);
}
A more complicated example might be less predictable. What if "test()" simply runs the next test case in a file?

although the two test() calls appear funny, there are plausible explanations to what is going on (that's not to say it's necessarily good practice).

OOPS - let me retract that, I read your code rather than the code in the article.....My Bad!!

2011-02-10 Reply Admin

trwtf:
Umm:
So why were half of yesterdays comments truncated?
Off-topic comments are routinely deleted by the editors. The deleted comments had nothing to do with the article.

Plus there seemed to be a thread starting about V1agra

Coyne · 2011-02-10 Reply Admin

Blowfish:
All the nice RegEx's in the beginning... (btw, one running one of them twice only removes two attributes from the tags, so if there are more...)

I thought of that, too. So I did this:

That got it into production the first time...

Squeaky-Clean HTML

Leave a comment on “Squeaky-Clean HTML”