- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
I tried extracting text from HTML with regular expressions once.
Once.
For about five minutes.
Admin
Man, this problems are like a wet dream to me. I mean, they have it all, poorly formatted XML and RegEx. When I told my parents with a smile on my face that I would take CS in college, this was exactly what I was thinking about... and PHP, which for some reason didn't feel like a good idea later on.
Admin
lynx -dump http://www.w3.org/
Admin
Well, my first thought on this is: He really, Really, REALLY likes lots of objects; a whole flood of objects; in fact, a deluge of objects; and MONSTROUS objects, too! In fact, he is the King King of garbage collection testers.
My second thought is: He really can't keep a thought in mind from one part of a program to another, can he? Because otherwise, he wouldn't have put in all that object-creating code to regularize the line breaks and tabs he removed at the top.
My third thought is: He really doesn't like ampersands, does he? After all, he just killed them all...including "&".
Admin
The sites given make this far too easy.
Admin
Admin
I had a Geocities page in 2000. Now I has a sad.
(Okay, I admit, it was an MST3K fanfic site.)
Admin
Admin
This is clearly a trick. The giveaway is criterion #2, because we all know that there is no important content on any Geocities site from 2000.
Admin
Why this one gets featured instead of the first comment?
Admin
That, and since Geocities is long gone (Yahoo closed them in what, 2010?), even the cat content filter will pass #2.
OTOH, if it was Myspace... which then again has no useful content either...
Admin
Admin
@Garrison Fnord:
The hyphens are the only tweak that I might apply. The sentence does use valid grammar (not "grammer", you twit). FTFY, partly; one does not speak grammar, one uses it. Except in your obvious-trollish case, of course. Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")Admin
I honestly LOVE all of the responses to my post.
Admin
Considering the requirements, I can do that easily. Here it is in Java:
Because since when have those sites had any important content? :P
Admin
[quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]
Mission Accomplished
Admin
I think you meant "Julienne"
Admin
Ah, finally, a WTF worthy of my eagle eye. While you're all too busy moaning about regex, do you notice at the end, how many time they need to do way instain loop> which kill thier performance? Too notice this, I dare pary I am truely the frits.
Admin
It appears to be on http://www.webmonkey.com/2010/02/special_characters/ now. Still doesn't list euro.
Admin
[quote user="Obviously Fake Garrison Fiord"][quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]
Mission Accomplished[/quote]How do we know that "Obviously Fake Garrison Fiord" is not actually the very real "Garrison Fiord"?
Admin
Admin
Wouldn't any such filter simply return the empty string?
Admin
Outageously bad coding style. In order to alert the user to the fact that he/she is about to read a return statement, it ought to be:
Nothing less is even remotely acceptable.
Admin
Just ran a search for "How to Extract Text from HTML". It's already 3rd on google.
Admin
Since I assume you don't use Opera, there is also this bookmarklet.
Ghostery is also fine if you just want to filter Facebook buttons and other crap.
Admin
I like the loop
If the text left is 1 MB, it will loop 1 million times trying to replace ever longer sequences of breaks and tabs (when they should be getting shorter with each iteration)
For the last few loops, it will be trying to replace a string of breaks and a string of tabs which is longer than the original string!.
Admin
I heard Chuck Norris is able to parse HTML using regular expressions.
Admin
Oh, that's easy. A module that produces empty output passes.
Proof: For any yahoo, geocities or myspace page, the set "important content" is empty.
QED
Admin
Ummm, could you repeat "...the following criteria"? I saw process steps (how to achieve a criteria), but no criteria! ;-) Have a great weekend all!
Admin
What about // this is where i gave up
Admin
Near the top: // Replace line breaks with space // because browsers inserts space result = result.Replace("\n", " ");
In the middle: // make line breaking consistent result = result.Replace("\n", "\r");
Just in case the first replace didn't replaced all :)
Admin
I can't imagine any of the three sites you listed have any important content. It seems like echo "\n" fulfills all those requirements.
Admin
Back when I worked in 6809 asm (old videogame code) I came across a lengthy code block, pages and pages long, with just one comment at the end:
die ; we are outta here!!
"die" was a macro that expanded to "jsr sys_proc_exit" in out heavily multi-threaded (and very efficient) system.
Admin
The using statement is for pussies! wtg!
Admin
Regex sucks for HTML but .net can compile REGEX into DLLs and make some operations FLY!
Admin
I really don't see anything wrong with that code.
It's the classic 'remove everything that you don't want and what is left is what you want' approach.
Admin
return null.
Admin
This is correct, Geoff. (I posted the code btw)
To "fix" it, I initially integrated HtmlAgility I believe or some similar lightweight library for HTML stripping like many suggested on this thread (which of course made it much faster). But then, as you mention, I realized that the appearance of the "extracted text" was not the same, which turned out to be pretty important since it was viewed by a viewer in the application as the "extracted text" version of an HTML formatted email.
Either way, I sped that method up dramatically (that loop at the bottom is ridiculous), but I wanted to point out that (like most things) it wasn't as black/white as it seemed at first...
Admin
When all you have is a hammer; now you have two problems.
Admin
If you're processing those pages with those criteria, the easiest way to meet the spec would be to return the empty string.
Admin
This is, in fact, the first page returned by a Google search for "extract html text". I was all set to copy and paste the function before I read it a bit more closely.
Admin
Or just doing it one of the excellent API services for this. Diffbot makes this dead simple:
http://diffbot.com/api/article?token=...&url=...
and you get back a JSON object with your cleaned text, title, images, dates and whatever else you could want.