How to Extract Text from HTML (Experts Only)

2012-09-12

Starting today, all my return statements will have the comment // That's it.

2012-09-12

Given the popularity of this site, I can't help but wonder how soon this will reach the top of search results and some ignorant idiot thinks that this is a good way to do it...

2012-09-12

Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

2012-09-12

My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.

chubertdev · 2012-09-12

To be certified as a good content filter, any module should have to pass this set of tests:

Strip these pages:

yahoo.com
five random Geocities sites from 2000
five random MySpace pages

To meet the following criteria:

remove all ads and extraneous material
preserve the important content
the only formatting should be semantic formatting on the important content

I call it the Flying Spaghetti Code test.

How to Extract Text from HTML (Experts Only)

Featured Comments