• Tom (unregistered)

    I tried extracting text from HTML with regular expressions once.

    Once.

    For about five minutes.

  • (cs) in reply to chubertdev

    Man, this problems are like a wet dream to me. I mean, they have it all, poorly formatted XML and RegEx. When I told my parents with a smile on my face that I would take CS in college, this was exactly what I was thinking about... and PHP, which for some reason didn't feel like a good idea later on.

  • nobody (unregistered)

    lynx -dump http://www.w3.org/

  • (cs)

    Well, my first thought on this is: He really, Really, REALLY likes lots of objects; a whole flood of objects; in fact, a deluge of objects; and MONSTROUS objects, too! In fact, he is the King King of garbage collection testers.

    My second thought is: He really can't keep a thought in mind from one part of a program to another, can he? Because otherwise, he wouldn't have put in all that object-creating code to regularize the line breaks and tabs he removed at the top.

    My third thought is: He really doesn't like ampersands, does he? After all, he just killed them all...including "&".

  • Anogynous (unregistered) in reply to chubertdev

    The sites given make this far too easy.

    public String filterHTML (String inputPage) {
        return null;
    }
    
  • (cs) in reply to chubertdev
    chubertdev:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    When has there been important content on any of those sites?

  • Calli Arcale (unregistered) in reply to PedanticCurmudgeon
    PedanticCurmudgeon:
    chubertdev:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    When has there been important content on any of those sites?

    I had a Geocities page in 2000. Now I has a sad.

    (Okay, I admit, it was an MST3K fanfic site.)

  • Garrison Fiord (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    Garrison Fiord:
    D-Coder:
    Garrison Fiord:
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.
    Okay, now we know you're doing it on purpose.
    Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

    I also haven't registered my name so I can't edit my posts and other strangers can just use it and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself

    I probably wouldn't want the jerks who stoop to modify my comments to have my email address, thank you.

  • Joe (unregistered) in reply to chubertdev
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    This is clearly a trick. The giveaway is criterion #2, because we all know that there is no important content on any Geocities site from 2000.

  • (cs) in reply to Tom Mathias
    Tom Mathias:
    Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

    This is the third comment linking to that StackOverflow thread.

    Why this one gets featured instead of the first comment?

  • OccupyWallStreet (unregistered) in reply to Joe
    Joe:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    This is clearly a trick. The giveaway is criterion #2, because we all know that there is no important content on any Geocities site from 2000.

    That, and since Geocities is long gone (Yahoo closed them in what, 2010?), even the cat content filter will pass #2.

    OTOH, if it was Myspace... which then again has no useful content either...

  • Garrison Fiord (unregistered) in reply to Zylon
    Zylon:
    qazwsx:
    Not every long sentence is a run-on.
    Don't feed the very obvious troll.
    Didn't your mommy ever tell you about the little boy who cried "troll?"
  • (cs) in reply to Garrison Fiord

    @Garrison Fnord:

    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure-vanilla, consistently-designed pages, it can be a big mess.
    The hyphens are the only tweak that I might apply. The sentence does use valid grammar (not "grammer", you twit).

    Don't you idiots understand that Mark is intentionally Bowytzing my posts? I speak good grammar all the time.
    FTFY, partly; one does not speak grammar, one uses it. Except in your obvious-trollish case, of course.
    strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself
    Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")
  • (cs)

    I honestly LOVE all of the responses to my post.

  • (cs) in reply to chubertdev
    chubertdev:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    Considering the requirements, I can do that easily. Here it is in Java:

    public static String filterContent(String source) {
        return "";
    }
    

    Because since when have those sites had any important content? :P

  • Obviously Fake Garrison Fiord (unregistered) in reply to Silverhill

    [quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]

    Mission Accomplished

  • Grammar Nazi (unregistered) in reply to iWantToKeepAnon

    I think you meant "Julienne"

  • entheh (unregistered)

    Ah, finally, a WTF worthy of my eagle eye. While you're all too busy moaning about regex, do you notice at the end, how many time they need to do way instain loop> which kill thier performance? Too notice this, I dare pary I am truely the frits.

  • (cs) in reply to RRDY
    RRDY:
    Y_F:
    TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

    That's the first thing that caught my eye. I wonder how many years it's been since that URL actually worked?

    It appears to be on http://www.webmonkey.com/2010/02/special_characters/ now. Still doesn't list euro.

  • This might be Garrison Fiord (unregistered) in reply to Obviously Fake Garrison Fiord

    [quote user="Obviously Fake Garrison Fiord"][quote]strangers can just use [my name] and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself[/quote]Indeed you do! (But remember the famous advice: "Never attribute to malice that which can be adequately explained by [your own] stupidity.")[/quote]

    Mission Accomplished[/quote]How do we know that "Obviously Fake Garrison Fiord" is not actually the very real "Garrison Fiord"?

  • Pinky Fjord (unregistered) in reply to This might be Garrison Fiord
    This might be Garrison Fiord:
    How do we know that "Obviously Fake Garrison Fiord" is not actually the very real "Garrison Fiord"?
    Especially given the consistency in bad quoting....
  • Cookyt (unregistered) in reply to chubertdev

    Wouldn't any such filter simply return the empty string?

  • (cs) in reply to Hypnos
    Hypnos:
    Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

    Outageously bad coding style. In order to alert the user to the fact that he/she is about to read a return statement, it ought to be:

        /*
         * This is where the function / subroutine / method / delete as applicable
         * is expected to terminate. This is because it has completed
         * the task assigned for it, and can pass control back to the 
         * function / subroutine / method / program (delete as applicable)
         * which called it. Please do not omit the task of completing the
         * documentation. Thank you for your attention to detail.
         */
    

    Nothing less is even remotely acceptable.

  • Helluin (unregistered) in reply to chubertdev

    Just ran a search for "How to Extract Text from HTML". It's already 3rd on google.

  • Captcha:cogo (unregistered) in reply to Ralph
    Ralph:
    What we need is a function that turns this... <html><script ... bloatware ... adware ... branded header ... sidebars ... fancy "navigation" crap ... disclaimers ...>

    Tiny bit of useful content.

    <... more crap ...>

    into...

    Tiny bit of useful content.

    Something like cleanPages?

    Since I assume you don't use Opera, there is also this bookmarklet.

    Ghostery is also fine if you just want to filter Facebook buttons and other crap.

  • Peter Lawrey (unregistered)

    I like the loop

            for (int index = 0; index < result.Length; index++)
            {
                result = result.Replace(breaks, "\r\r");
                result = result.Replace(tabs, "\t\t\t\t");
                breaks = breaks + "\r";
                tabs = tabs + "\t";
            }
    

    If the text left is 1 MB, it will loop 1 million times trying to replace ever longer sequences of breaks and tabs (when they should be getting shorter with each iteration)

    For the last few loops, it will be trying to replace a string of breaks and a string of tabs which is longer than the original string!.

  • cousteau (unregistered)

    I heard Chuck Norris is able to parse HTML using regular expressions.

  • Douglas Henke (unregistered) in reply to chubertdev

    Oh, that's easy. A module that produces empty output passes.

    Proof: For any yahoo, geocities or myspace page, the set "important content" is empty.

    QED

  • (cs) in reply to chubertdev

    Ummm, could you repeat "...the following criteria"? I saw process steps (how to achieve a criteria), but no criteria! ;-) Have a great weekend all!

  • hmm (unregistered) in reply to QJo
    QJo:
    Hypnos:
    Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

    Outageously bad coding style. In order to alert the user to the fact that he/she is about to read a return statement, it ought to be:

        /*
         * This is where the function / subroutine / method / delete as applicable
         * is expected to terminate. This is because it has completed
         * the task assigned for it, and can pass control back to the 
         * function / subroutine / method / program (delete as applicable)
         * which called it. Please do not omit the task of completing the
         * documentation. Thank you for your attention to detail.
         */
    

    Nothing less is even remotely acceptable.

    What about // this is where i gave up

  • D. (unregistered)

    Near the top: // Replace line breaks with space // because browsers inserts space result = result.Replace("\n", " ");

    In the middle: // make line breaking consistent result = result.Replace("\n", "\r");

    Just in case the first replace didn't replaced all :)

  • Andrew Beard (unregistered) in reply to chubertdev
    chubertdev:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

    I can't imagine any of the three sites you listed have any important content. It seems like echo "\n" fulfills all those requirements.

  • Graham (unregistered) in reply to szeryf

    Back when I worked in 6809 asm (old videogame code) I came across a lengthy code block, pages and pages long, with just one comment at the end:

    die ; we are outta here!!

    "die" was a macro that expanded to "jsr sys_proc_exit" in out heavily multi-threaded (and very efficient) system.

  • Lars-Erik (unregistered)

    The using statement is for pussies! wtg!

  • Len (unregistered)

    Regex sucks for HTML but .net can compile REGEX into DLLs and make some operations FLY!

  • tester (unregistered)

    I really don't see anything wrong with that code.

    It's the classic 'remove everything that you don't want and what is left is what you want' approach.

  • Gobzo (unregistered) in reply to chubertdev
    chubertdev:
    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages: ... 3) five random MySpace pages

    To meet the following criteria: ... 2) preserve the important content ...

    return null.

  • Joe (unregistered) in reply to Geoff

    This is correct, Geoff. (I posted the code btw)

    To "fix" it, I initially integrated HtmlAgility I believe or some similar lightweight library for HTML stripping like many suggested on this thread (which of course made it much faster). But then, as you mention, I realized that the appearance of the "extracted text" was not the same, which turned out to be pretty important since it was viewed by a viewer in the application as the "extracted text" version of an HTML formatted email.

    Either way, I sped that method up dramatically (that loop at the bottom is ridiculous), but I wanted to point out that (like most things) it wasn't as black/white as it seemed at first...

  • Lucas H (unregistered)

    When all you have is a hammer; now you have two problems.

  • Grayson C (unregistered) in reply to chubertdev

    If you're processing those pages with those criteria, the easiest way to meet the spec would be to return the empty string.

  • Ed (unregistered) in reply to Coward

    This is, in fact, the first page returned by a Google search for "extract html text". I was all set to copy and paste the function before I read it a bit more closely.

  • Andy Haritovich (unregistered)

    Or just doing it one of the excellent API services for this. Diffbot makes this dead simple:

    http://diffbot.com/api/article?token=...&url=...

    and you get back a JSON object with your cleaned text, title, images, dates and whatever else you could want.

Leave a comment on “How to Extract Text from HTML (Experts Only)”

Log In or post as a guest

Replying to comment #:

« Return to Article