The Daily WTF: Curious Perversions in Information Technology

2012-09-12 Reply Admin

It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!

s/<.*>//g

2012-09-12 Reply Admin

Are these the solutions you find on the Internet?

2012-09-12 Reply Admin

Great now someone will copy and paste that into their app. If you are a Python user Scrapy is pretty awesome. http://scrapy.org/

2012-09-12 Reply Admin

Now, I may not be a professional developer, but wouldn't for instance php's strip_tags function do this?

Then again, I'm not getting paid per line of code, so what do I know?

2012-09-12 Reply Admin

Starting today, all my return statements will have the comment // That's it.

2012-09-12 Reply Admin

What we need is a function that turns this...

Tiny bit of useful content.

<... more crap ...>

into...

Tiny bit of useful content.

2012-09-12 Reply Admin

szeryf:
Starting today, all my return statements will have the comment // That's it.

Ha ha ha, brilliant!

dignissim - an attempt at simulating dignity

GettinSadda · 2012-09-12 Reply Admin

Just spotted this, but seriously... is this the best way to do this?

public static string StripTagsCharArray(string source) {
    try
    {
        string result;
...

2012-09-12 Reply Admin

Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

2012-09-12 Reply Admin

I wrote one of these and it worked pretty well, and I didn't use RegEx. But I did it as a series of classes. I am not sure why someone decides to do it as a single function unless they fail to realize the scope. There are issues of whitespace, tabs, number of lines between p or br tags, bulleted and numbered lists, horizontal lines, and all of the funky escaped and encoded characters. OK, not really sure what my point is.

2012-09-12 Reply Admin

Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Stop to criticize gammer error. Ain't being worth any being's time.

2012-09-12 Reply Admin

Captcha: dolor - because my head hurts.

ASheridan · 2012-09-12 Reply Admin

Garrison Fiord:
or are you just going to buntcher my posts like that drunken narcissist Remy?

*butcher

2012-09-12 Reply Admin

Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.

2012-09-12 Reply Admin

qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.

That one certainly is, though. Great job, Mark, on showing your not the bigger man.

Zemm · 2012-09-12 Reply Admin

Sheesh:
It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!
s/<.*>//g

Ah, that would clobber the entire document, since * is greedy!

s/<.?>//g or s/<[^>]>//g

Better. Assuming valid (ish) markup. And then just replace HTML entities with their characters.

It would show the inline Javascript and style sheets. It would fail for CDATA sections containing > characters. Embedded newlines will be preserved whereas

and
and table codes will be ignored. So you may want to do something about those too...

It's not a trivial process to turn HTML into POT!

2012-09-12 Reply Admin

Given the popularity of this site, I can't help but wonder how soon this will reach the top of search results and some ignorant idiot thinks that this is a good way to do it...

2012-09-12 Reply Admin

Zemm:
Sheesh:
It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!
s/<.*>//g

Ah, that would clobber the entire document, since * is greedy!

I assume that was the point (pun intended).

2012-09-12 Reply Admin

Zemm:
It's not a trivial process to turn HTML into POT!

(Spoken in the sarcastic voice of one of the most clued-in managers I ever had the pleasure of working for...) I don't understand it, so it must be easy!

2012-09-12 Reply Admin

Zemm:
Sheesh:
It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!
s/<.*>//g

Ah, that would clobber the entire document, since * is greedy!

That seems the most sensible thing to do, all things considered. Have you seen that internet lately?

Y_F · 2012-09-12 Reply Admin

TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

2012-09-12 Reply Admin

llandor:
Now, I may not be a professional developer, but wouldn't for instance php's strip_tags function do this?

The ideal solution must surely be PHP's wonderful fgetss() function, which is "Identical to fgets(), except that fgetss() attempts to strip any NUL bytes, HTML and PHP tags from the text it reads."

Perfect! I always knew it would come in handy one day

2012-09-12 Reply Admin

Garrison Fiord:
qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.
That one certainly is, though. Great job, Mark, on showing your not the bigger man.

"You're not" or "You aren't". Do you want to start a grammar fight by a sentence that has a syntax error?

2012-09-12 Reply Admin

Not that what the author of the article posted is the right way to go about it but, you can clearly infer from what they wrote there was intent to preserve some of the html documents's formatting as plain text. It replaces
with \n for example. Your suggestion would lose all of the formatting.

D-Coder · 2012-09-12 Reply Admin

Garrison Fiord:
qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.
That one certainly is, though. Great job, Mark, on showing your not the bigger man.

Okay, now we know you're doing it on purpose.

2012-09-12 Reply Admin

Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-.

Error. Sentence fragments.

2012-09-12 Reply Admin

Honnza:
"You're not" or "You aren't". Do you want to start a grammar fight by a sentence that has a syntax error?

Do you want to start fighting a troll by getting trolled?

2012-09-12 Reply Admin

Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

Cbuttius · 2012-09-12 Reply Admin

Your average developer, when faced with this situation, would do what any sane person would do – first try to tackle the problem on their own and after a few frustrating iterations, eventually turn to the Internet to solve their problem. Thankfully, as it turns out, the problem of parsing text out of web code has been solved several times over effectively turning your development task into an integration task. Hooray!

whereas your expert developer will probably google first before attempting to tackle the issue themselves.

If I did have to write my own implementation for HTML pages, if they are HTML1.1 compliant then they are valid XML and could be parsed as such, but HTML1.0 doesn't enforce that and as such will often contain breaks like
as unclosed tags so won't work with an XML parser.

Regex is not good for parsing recursive (nested) expressions.

Filed under reinventing your own square wheel anti-pattern

2012-09-12 Reply Admin

It will never beat the related Stack Overflow question on the subject. The accepted answer is hilarous. Read and enjoy: RegEx match open tags except XHTML self-contained tags

The fun thing is that sometimes regexes aren't that bad. I think the Readability project actually used regexes coupled with some DOM parsing to actually solve exactly the problem this developer had - extracting the real article text out of websites.

no laughing matter · 2012-09-12 Reply Admin

Using regular expressions to parse HTML? I vaguely remember having read about this before... But only very vaguely, as from a distant nightmare!

Addendum (2012-09-12 15:37): Damn, beaten by Emil Vikström because stupid Akismet forced me to add a sentence to my post before accepting it is not spam!

Akismet is the real WTF.

2012-09-12 Reply Admin

D-Coder:
Garrison Fiord:
qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.
That one certainly is, though. Great job, Mark, on showing your not the bigger man.
Okay, now we know you're doing it on purpose.

Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

2012-09-12 Reply Admin

Emil Vikström:
It will never beat the related Stack Overflow question on the subject. The accepted answer is hilarous. Read and enjoy: RegEx match open tags except XHTML self-contained tags

Truth, brother, truth! Yea verily the apocalypse is upon us, even at the door.

2012-09-12 Reply Admin

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

--jwz

2012-09-12 Reply Admin

Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

2012-09-12 Reply Admin

Garrison Fiord:
D-Coder:
Garrison Fiord:
qazwsx:
Garrison Fiord:
Mark Bowytz:
With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

Not every long sentence is a run-on.
That one certainly is, though. Great job, Mark, on showing your not the bigger man.
Okay, now we know you're doing it on purpose.
Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

I also haven't registered my name so I can't edit my posts and other strangers can just use it and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself

redtetrahedron · 2012-09-12 Reply Admin

Kinda looks like C#... How about a WebBrowser object (snippet):

string result = string.Empty;
webBrowser1.DocumentText = source;
webBrowser1.DocumentCompleted += (sender, e) =>
    {
        HtmlDocument doc = webBrowser1.Document;
        result = doc.GetElementsByTagName("html")[0].InnerText;
    };

Addendum (2012-09-12 10:41): (Yea, there are probably threading issues with this... just trying to illustrate the technique...)

2012-09-12 Reply Admin

Y_F:
TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

That's the first thing that caught my eye. I wonder how many years it's been since that URL actually worked?

Captcha: ACSI: The earliest SCSI prototype: about 18 versions prior to the final release.

2012-09-12 Reply Admin

Obligatory regex + HTML = insanity: http://stackoverflow.com/a/1732454/213197

chubertdev · 2012-09-12 Reply Admin

Tom Mathias:
Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

Dang, beat me to it.

Do I still get a silver medal for the "you have a problem, you use regular expressions, and now you have two problems" quote?

Addendum (2012-09-12 11:19): EDIT:

Dang again, beaten to that one as well.

chubertdev · 2012-09-12 Reply Admin

The TRWTF here is the reality of trying to parse a very forgiving, interpreted language using a strict set of rules.

http://www.codinghorror.com/blog/2007/04/javascript-and-html-forgiveness-by-default.html

2012-09-12 Reply Admin

My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.

2012-09-12 Reply Admin

iWantToKeepAnon:
My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.

saxon is Java though, so that would make me sad. Slightly lighter is python, lxml and html5lib.

Zylon · 2012-09-12 Reply Admin

qazwsx:
Not every long sentence is a run-on.

Don't feed the very obvious troll.

2012-09-12 Reply Admin

Emil Vikström:
It will never beat the related Stack Overflow question on the subject. Read and enjoy: RegEx match open tags except XHTML self-contained tags

That cad, Sam, got over 10% of his reputation from his answer to that question...

Hmmmm · 2012-09-12 Reply Admin

Jamie Zawinski was right:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have twenty-eight problems.

FTFY

2012-09-12 Reply Admin

[quote user="redtetrahedron"]

string result = string.Empty;
webBrowser1.DocumentText = source;
webBrowser1.DocumentCompleted += (sender, e) =>
{
HtmlDocument doc = webBrowser1.Document;
// That's it.
result = doc.GetElementsByTagName("html")[0].InnerText;
};

quote] FTFY, now it's consistent with today's topic.

2012-09-12 Reply Admin

Is this what's called a regdep? Or, Regular Depression?

At least this is comforting. It lets me know that I can drink and dope myself into oblivion. All I have to do to get a job in IT is to semi-sober up for a couple of hours to go to an interview.

2012-09-12 Reply Admin

I thought the WTF was the line about processing (only) 700k files. Does this refer to how many were done or that it could only handle files of that size?

And we don't know the state of the original HTML, when the code was written, or whether this is code which has been changed because of changes to the HTML.

chubertdev · 2012-09-12 Reply Admin

To be certified as a good content filter, any module should have to pass this set of tests:

Strip these pages:

yahoo.com
five random Geocities sites from 2000
five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material
preserve the important content
the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

How to Extract Text from HTML (Experts Only)

Leave a comment on “How to Extract Text from HTML (Experts Only)”