- Feature Articles
- CodeSOD
- Error'd
- 
                
                    Forums 
- 
                Other Articles
                - Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
 
 
            
Admin
It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!
s/<.*>//g
Admin
Are these the solutions you find on the Internet?
Admin
Great now someone will copy and paste that into their app. If you are a Python user Scrapy is pretty awesome. http://scrapy.org/
Admin
Now, I may not be a professional developer, but wouldn't for instance php's strip_tags function do this?
Then again, I'm not getting paid per line of code, so what do I know?
Admin
Starting today, all my return statements will have the comment // That's it.
Admin
What we need is a function that turns this...
<html><script ... bloatware ... adware ... branded header ... sidebars ... fancy "navigation" crap ... disclaimers ...>Tiny bit of useful content.
<... more crap ...>into...
Tiny bit of useful content.
Admin
Ha ha ha, brilliant!
Admin
Just spotted this, but seriously... is this the best way to do this?
Admin
Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?
Admin
I wrote one of these and it worked pretty well, and I didn't use RegEx. But I did it as a series of classes. I am not sure why someone decides to do it as a single function unless they fail to realize the scope. There are issues of whitespace, tabs, number of lines between p or br tags, bulleted and numbered lists, horizontal lines, and all of the funky escaped and encoded characters. OK, not really sure what my point is.
Admin
Admin
Captcha: dolor - because my head hurts.
Admin
Admin
Not every long sentence is a run-on.
Admin
Admin
Ah, that would clobber the entire document, since * is greedy!
s/<.?>//g or s/<[^>]>//g
Better. Assuming valid (ish) markup. And then just replace HTML entities with their characters.
It would show the inline Javascript and style sheets. It would fail for CDATA sections containing > characters. Embedded newlines will be preserved whereas
andand table codes will be ignored. So you may want to do something about those too...
It's not a trivial process to turn HTML into POT!
Admin
Given the popularity of this site, I can't help but wonder how soon this will reach the top of search results and some ignorant idiot thinks that this is a good way to do it...
Admin
Admin
Admin
That seems the most sensible thing to do, all things considered. Have you seen that internet lately?
Admin
TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/
Admin
Perfect! I always knew it would come in handy one day
Admin
"You're not" or "You aren't". Do you want to start a grammar fight by a sentence that has a syntax error?
Admin
Not that what the author of the article posted is the right way to go about it but, you can clearly infer from what they wrote there was intent to preserve some of the html documents's formatting as plain text. It replaces
with \n for example. Your suggestion would lose all of the formatting.
Admin
Admin
Admin
Admin
Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"
Admin
whereas your expert developer will probably google first before attempting to tackle the issue themselves.
If I did have to write my own implementation for HTML pages, if they are HTML1.1 compliant then they are valid XML and could be parsed as such, but HTML1.0 doesn't enforce that and as such will often contain breaks like
as unclosed tags so won't work with an XML parser.
Regex is not good for parsing recursive (nested) expressions.
Filed under reinventing your own square wheel anti-pattern
Admin
It will never beat the related Stack Overflow question on the subject. The accepted answer is hilarous. Read and enjoy: RegEx match open tags except XHTML self-contained tags
The fun thing is that sometimes regexes aren't that bad. I think the Readability project actually used regexes coupled with some DOM parsing to actually solve exactly the problem this developer had - extracting the real article text out of websites.
Admin
Using regular expressions to parse HTML? I vaguely remember having read about this before... But only very vaguely, as from a distant nightmare!
Addendum (2012-09-12 15:37): Damn, beaten by Emil Vikström because stupid Akismet forced me to add a sentence to my post before accepting it is not spam!
Akismet is the real WTF.
Admin
Admin
Admin
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
--jwz
Admin
Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca
Admin
I also haven't registered my name so I can't edit my posts and other strangers can just use it and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself
Admin
Kinda looks like C#... How about a WebBrowser object (snippet):
string result = string.Empty; webBrowser1.DocumentText = source; webBrowser1.DocumentCompleted += (sender, e) => { HtmlDocument doc = webBrowser1.Document; result = doc.GetElementsByTagName("html")[0].InnerText; };Addendum (2012-09-12 10:41): (Yea, there are probably threading issues with this... just trying to illustrate the technique...)
Admin
That's the first thing that caught my eye. I wonder how many years it's been since that URL actually worked?
Captcha: ACSI: The earliest SCSI prototype: about 18 versions prior to the final release.
Admin
Obligatory regex + HTML = insanity: http://stackoverflow.com/a/1732454/213197
Admin
Dang, beat me to it.
Do I still get a silver medal for the "you have a problem, you use regular expressions, and now you have two problems" quote?
Addendum (2012-09-12 11:19): EDIT:
Dang again, beaten to that one as well.
Admin
The TRWTF here is the reality of trying to parse a very forgiving, interpreted language using a strict set of rules.
http://www.codinghorror.com/blog/2007/04/javascript-and-html-forgiveness-by-default.html
Admin
My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.
Admin
saxon is Java though, so that would make me sad. Slightly lighter is python, lxml and html5lib.
Admin
Admin
Admin
FTFY
Admin
[quote user="redtetrahedron"]
string result = string.Empty; webBrowser1.DocumentText = source; webBrowser1.DocumentCompleted += (sender, e) => { HtmlDocument doc = webBrowser1.Document; // That's it. result = doc.GetElementsByTagName("html")[0].InnerText; };quote] FTFY, now it's consistent with today's topic.
Admin
Is this what's called a regdep? Or, Regular Depression?
At least this is comforting. It lets me know that I can drink and dope myself into oblivion. All I have to do to get a job in IT is to semi-sober up for a couple of hours to go to an interview.
Admin
I thought the WTF was the line about processing (only) 700k files. Does this refer to how many were done or that it could only handle files of that size?
And we don't know the state of the original HTML, when the code was written, or whether this is code which has been changed because of changes to the HTML.
Admin
To be certified as a good content filter, any module should have to pass this set of tests:
Strip these pages:
To meet the following criteria:
I call it the Flying Spaghetti Code test.