The Daily WTF: Curious Perversions in Information Technology

2012-09-12 Reply Admin

I tried extracting text from HTML with regular expressions once.

Once.

For about five minutes.

ubersoldat · 2012-09-12 Reply Admin

Man, this problems are like a wet dream to me. I mean, they have it all, poorly formatted XML and RegEx. When I told my parents with a smile on my face that I would take CS in college, this was exactly what I was thinking about... and PHP, which for some reason didn't feel like a good idea later on.

2012-09-12 Reply Admin

lynx -dump http://www.w3.org/

Coyne · 2012-09-12 Reply Admin

Well, my first thought on this is: He really, Really, REALLY likes lots of objects; a whole flood of objects; in fact, a deluge of objects; and MONSTROUS objects, too! In fact, he is the King King of garbage collection testers.

My second thought is: He really can't keep a thought in mind from one part of a program to another, can he? Because otherwise, he wouldn't have put in all that object-creating code to regularize the line breaks and tabs he removed at the top.

My third thought is: He really doesn't like ampersands, does he? After all, he just killed them all...including "&".

2012-09-12 Reply Admin

The sites given make this far too easy.

public String filterHTML (String inputPage) {
    return null;
}

PedanticCurmudgeon · 2012-09-12 Reply Admin

chubertdev:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

When has there been important content on any of those sites?

2012-09-12 Reply Admin

PedanticCurmudgeon:
chubertdev:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.
When has there been important content on any of those sites?

I had a Geocities page in 2000. Now I has a sad.

(Okay, I admit, it was an MST3K fanfic site.)

2012-09-12 Reply Admin

Garrison Fiord:
Garrison Fiord:
D-Coder:
Garrison Fiord:
qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.
That one certainly is, though. Great job, Mark, on showing your not the bigger man.
Okay, now we know you're doing it on purpose.
Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

I also haven't registered my name so I can't edit my posts and other strangers can just use it and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself

I probably wouldn't want the jerks who stoop to modify my comments to have my email address, thank you.

2012-09-12 Reply Admin

To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

This is clearly a trick. The giveaway is criterion #2, because we all know that there is no important content on any Geocities site from 2000.

no laughing matter · 2012-09-12 Reply Admin

Tom Mathias:
Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

This is the third comment linking to that StackOverflow thread.

Why this one gets featured instead of the first comment?

2012-09-12 Reply Admin

Joe:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

This is clearly a trick. The giveaway is criterion #2, because we all know that there is no important content on any Geocities site from 2000.

That, and since Geocities is long gone (Yahoo closed them in what, 2010?), even the cat content filter will pass #2.

OTOH, if it was Myspace... which then again has no useful content either...

2012-09-12 Reply Admin

Zylon:
qazwsx:
Not every long sentence is a run-on.
Don't feed the very obvious troll.

Didn't your mommy ever tell you about the little boy who cried "troll?"

Silverhill · 2012-09-12 Reply Admin

@Garrison Fnord:

Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure-vanilla, consistently-designed pages, it can be a big mess.

The hyphens are the only tweak that I might apply. The sentence does use valid grammar (not "grammer", you twit).

Don't you idiots understand that Mark is intentionally Bowytzing my posts? I speak good grammar all the time.

FTFY, partly; one does not speak grammar, one uses it. Except in your obvious-trollish case, of course.

strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself

Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")

chubertdev · 2012-09-12 Reply Admin

I honestly LOVE all of the responses to my post.

Poochy.EXE · 2012-09-12 Reply Admin

chubertdev:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

Considering the requirements, I can do that easily. Here it is in Java:

public static String filterContent(String source) {
    return "";
}

Because since when have those sites had any important content? :P

2012-09-12 Reply Admin

[quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]

Mission Accomplished

2012-09-12 Reply Admin

I think you meant "Julienne"

2012-09-12 Reply Admin

Ah, finally, a WTF worthy of my eagle eye. While you're all too busy moaning about regex, do you notice at the end, how many time they need to do way instain loop> which kill thier performance? Too notice this, I dare pary I am truely the frits.

Zemm · 2012-09-12 Reply Admin

RRDY:
Y_F:
TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

That's the first thing that caught my eye. I wonder how many years it's been since that URL actually worked?

It appears to be on http://www.webmonkey.com/2010/02/special_characters/ now. Still doesn't list euro.

2012-09-12 Reply Admin

[quote user="Obviously Fake Garrison Fiord"][quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]

Mission Accomplished[/quote]How do we know that "Obviously Fake Garrison Fiord" is not actually the very real "Garrison Fiord"?

2012-09-12 Reply Admin

This might be Garrison Fiord:
How do we know that "Obviously Fake Garrison Fiord" is not actually the very real "Garrison Fiord"?

Especially given the consistency in bad quoting....

2012-09-12 Reply Admin

Wouldn't any such filter simply return the empty string?

QJo · 2012-09-13 Reply Admin

Hypnos:
Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

Outageously bad coding style. In order to alert the user to the fact that he/she is about to read a return statement, it ought to be:

    /*
     * This is where the function / subroutine / method / delete as applicable
     * is expected to terminate. This is because it has completed
     * the task assigned for it, and can pass control back to the 
     * function / subroutine / method / program (delete as applicable)
     * which called it. Please do not omit the task of completing the
     * documentation. Thank you for your attention to detail.
     */

Nothing less is even remotely acceptable.

2012-09-13 Reply Admin

Just ran a search for "How to Extract Text from HTML". It's already 3rd on google.

2012-09-13 Reply Admin

Ralph:
What we need is a function that turns this... <html><script ... bloatware ... adware ... branded header ... sidebars ... fancy "navigation" crap ... disclaimers ...>
Tiny bit of useful content.
<... more crap ...>
into...

Tiny bit of useful content.

Something like cleanPages?

Since I assume you don't use Opera, there is also this bookmarklet.

Ghostery is also fine if you just want to filter Facebook buttons and other crap.

2012-09-13 Reply Admin

I like the loop

        for (int index = 0; index < result.Length; index++)
        {
            result = result.Replace(breaks, "\r\r");
            result = result.Replace(tabs, "\t\t\t\t");
            breaks = breaks + "\r";
            tabs = tabs + "\t";
        }

If the text left is 1 MB, it will loop 1 million times trying to replace ever longer sequences of breaks and tabs (when they should be getting shorter with each iteration)

For the last few loops, it will be trying to replace a string of breaks and a string of tabs which is longer than the original string!.

2012-09-13 Reply Admin

I heard Chuck Norris is able to parse HTML using regular expressions.

2012-09-13 Reply Admin

Oh, that's easy. A module that produces empty output passes.

Proof: For any yahoo, geocities or myspace page, the set "important content" is empty.

QED

Kudzu_Kid · 2012-09-13 Reply Admin

Ummm, could you repeat "...the following criteria"? I saw process steps (how to achieve a criteria), but no criteria! ;-) Have a great weekend all!

2012-09-13 Reply Admin

QJo:
Hypnos:
Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

Outageously bad coding style. In order to alert the user to the fact that he/she is about to read a return statement, it ought to be:
    /*
     * This is where the function / subroutine / method / delete as applicable
     * is expected to terminate. This is because it has completed
     * the task assigned for it, and can pass control back to the 
     * function / subroutine / method / program (delete as applicable)
     * which called it. Please do not omit the task of completing the
     * documentation. Thank you for your attention to detail.
     */
Nothing less is even remotely acceptable.

What about // this is where i gave up

2012-09-14 Reply Admin

Near the top: // Replace line breaks with space // because browsers inserts space result = result.Replace("\n", " ");

In the middle: // make line breaking consistent result = result.Replace("\n", "\r");

Just in case the first replace didn't replaced all :)

2012-09-14 Reply Admin

chubertdev:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:

yahoo.com

five random Geocities sites from 2000

five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material

preserve the important content

the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

I can't imagine any of the three sites you listed have any important content. It seems like echo "\n" fulfills all those requirements.

2012-09-14 Reply Admin

Back when I worked in 6809 asm (old videogame code) I came across a lengthy code block, pages and pages long, with just one comment at the end:

die ; we are outta here!!

"die" was a macro that expanded to "jsr sys_proc_exit" in out heavily multi-threaded (and very efficient) system.

2012-09-16 Reply Admin

The using statement is for pussies! wtg!

2012-09-17 Reply Admin

Regex sucks for HTML but .net can compile REGEX into DLLs and make some operations FLY!

2012-09-21 Reply Admin

I really don't see anything wrong with that code.

It's the classic 'remove everything that you don't want and what is left is what you want' approach.

2012-10-01 Reply Admin

chubertdev:
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages: ... 3) five random MySpace pages

To meet the following criteria: ... 2) preserve the important content ...

return null.

2012-10-01 Reply Admin

This is correct, Geoff. (I posted the code btw)

To "fix" it, I initially integrated HtmlAgility I believe or some similar lightweight library for HTML stripping like many suggested on this thread (which of course made it much faster). But then, as you mention, I realized that the appearance of the "extracted text" was not the same, which turned out to be pretty important since it was viewed by a viewer in the application as the "extracted text" version of an HTML formatted email.

Either way, I sped that method up dramatically (that loop at the bottom is ridiculous), but I wanted to point out that (like most things) it wasn't as black/white as it seemed at first...

2012-10-20 Reply Admin

When all you have is a hammer; now you have two problems.

2012-10-26 Reply Admin

If you're processing those pages with those criteria, the easiest way to meet the spec would be to return the empty string.

2013-07-22 Reply Admin

This is, in fact, the first page returned by a Google search for "extract html text". I was all set to copy and paste the function before I read it a bit more closely.

2013-12-22 Reply Admin

Or just doing it one of the excellent API services for this. Diffbot makes this dead simple:

http://diffbot.com/api/article?token=...&url=...

and you get back a JSON object with your cleaned text, title, images, dates and whatever else you could want.

How to Extract Text from HTML (Experts Only)

Leave a comment on “How to Extract Text from HTML (Experts Only)”