• Sheesh (unregistered)

    It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!

    s/<.*>//g

  • Hans (unregistered)

    Are these the solutions you find on the Internet?

  • Smug Unix User (unregistered)

    Great now someone will copy and paste that into their app. If you are a Python user Scrapy is pretty awesome. http://scrapy.org/

  • llandor (unregistered)

    Now, I may not be a professional developer, but wouldn't for instance php's strip_tags function do this?

    Then again, I'm not getting paid per line of code, so what do I know?

  • szeryf (unregistered)

    Starting today, all my return statements will have the comment // That's it.

  • Ralph (unregistered)

    What we need is a function that turns this...

    <html><script ... bloatware ... adware ... branded header ... sidebars ... fancy "navigation" crap ... disclaimers ...>

    Tiny bit of useful content.

    <... more crap ...>

    into...

    Tiny bit of useful content.

  • Gyxi (unregistered) in reply to szeryf
    szeryf:
    Starting today, all my return statements will have the comment // That's it.

    Ha ha ha, brilliant!

    • dignissim - an attempt at simulating dignity
  • (cs)

    Just spotted this, but seriously... is this the best way to do this?

    public static string StripTagsCharArray(string source) {
        try
        {
            string result;
    ...
  • Garrison Fiord (unregistered)
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

  • My name is unimportant (unregistered)

    I wrote one of these and it worked pretty well, and I didn't use RegEx. But I did it as a series of classes. I am not sure why someone decides to do it as a single function unless they fail to realize the scope. There are issues of whitespace, tabs, number of lines between p or br tags, bulleted and numbered lists, horizontal lines, and all of the funky escaped and encoded characters. OK, not really sure what my point is.

  • Nagesh (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Stop to criticize gammer error. Ain't being worth any being's time.

  • Philip Newton (unregistered)

    Captcha: dolor - because my head hurts.

  • (cs) in reply to Garrison Fiord
    Garrison Fiord:
    or are you just going to buntcher my posts like that drunken narcissist Remy?
    *butcher
  • qazwsx (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

  • Garrison Fiord (unregistered) in reply to qazwsx
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.

  • (cs) in reply to Sheesh
    Sheesh:
    It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!

    s/<.*>//g

    Ah, that would clobber the entire document, since * is greedy!

    s/<.?>//g or s/<[^>]>//g

    Better. Assuming valid (ish) markup. And then just replace HTML entities with their characters.

    It would show the inline Javascript and style sheets. It would fail for CDATA sections containing > characters. Embedded newlines will be preserved whereas

    and
    and table codes will be ignored. So you may want to do something about those too...

    It's not a trivial process to turn HTML into POT!

  • Coward (unregistered) in reply to Zemm

    Given the popularity of this site, I can't help but wonder how soon this will reach the top of search results and some ignorant idiot thinks that this is a good way to do it...

  • yetihehe (unregistered) in reply to Zemm
    Zemm:
    Sheesh:
    It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!

    s/<.*>//g

    Ah, that would clobber the entire document, since * is greedy!

    I assume that was the point (pun intended).
  • Amazing Expression (unregistered) in reply to Zemm
    Zemm:
    It's not a trivial process to turn HTML into POT!
    (Spoken in the sarcastic voice of one of the most clued-in managers I ever had the pleasure of working for...) I don't understand it, so it must be easy!
  • trtrwtf (unregistered) in reply to Zemm
    Zemm:
    Sheesh:
    It is so much easier than that! Just remove everything between angle brackets in one pass. What's left is the text!

    s/<.*>//g

    Ah, that would clobber the entire document, since * is greedy!

    That seems the most sensible thing to do, all things considered. Have you seen that internet lately?

  • (cs)

    TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

  • limos (unregistered) in reply to llandor
    llandor:
    Now, I may not be a professional developer, but wouldn't for instance php's strip_tags function do this?
    The ideal solution must surely be PHP's wonderful fgetss() function, which is "Identical to fgets(), except that fgetss() attempts to strip any NUL bytes, HTML and PHP tags from the text it reads."

    Perfect! I always knew it would come in handy one day

  • Honnza (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.

    "You're not" or "You aren't". Do you want to start a grammar fight by a sentence that has a syntax error?

  • Geoff (unregistered) in reply to Sheesh

    Not that what the author of the article posted is the right way to go about it but, you can clearly infer from what they wrote there was intent to preserve some of the html documents's formatting as plain text. It replaces
    with \n for example. Your suggestion would lose all of the formatting.

  • (cs) in reply to Garrison Fiord
    Garrison Fiord:
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.
    Okay, now we know you're doing it on purpose.

  • Linguo (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-.

    Error. Sentence fragments.

  • foo (unregistered) in reply to Honnza
    Honnza:
    "You're not" or "You aren't". Do you want to start a grammar fight by a sentence that has a syntax error?
    Do you want to start fighting a troll by getting trolled?
  • Hypnos (unregistered) in reply to szeryf

    Yeah, it's good. But I'm going for "// Th-th-th-that's all folks!"

  • (cs)
    Your average developer, when faced with this situation, would do what any sane person would do – first try to tackle the problem on their own and after a few frustrating iterations, eventually turn to the Internet to solve their problem. Thankfully, as it turns out, the problem of parsing text out of web code has been solved several times over effectively turning your development task into an integration task. Hooray!

    whereas your expert developer will probably google first before attempting to tackle the issue themselves.

    If I did have to write my own implementation for HTML pages, if they are HTML1.1 compliant then they are valid XML and could be parsed as such, but HTML1.0 doesn't enforce that and as such will often contain breaks like
    as unclosed tags so won't work with an XML parser.

    Regex is not good for parsing recursive (nested) expressions.


    Filed under reinventing your own square wheel anti-pattern

  • Emil Vikström (unregistered) in reply to Coward

    It will never beat the related Stack Overflow question on the subject. The accepted answer is hilarous. Read and enjoy: RegEx match open tags except XHTML self-contained tags

    The fun thing is that sometimes regexes aren't that bad. I think the Readability project actually used regexes coupled with some DOM parsing to actually solve exactly the problem this developer had - extracting the real article text out of websites.

  • (cs)

    Using regular expressions to parse HTML? I vaguely remember having read about this before... But only very vaguely, as from a distant nightmare!

    Addendum (2012-09-12 15:37): Damn, beaten by Emil Vikström because stupid Akismet forced me to add a sentence to my post before accepting it is not spam!

    Akismet is the real WTF.

  • Garrison Fiord (unregistered) in reply to D-Coder
    D-Coder:
    Garrison Fiord:
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.
    Okay, now we know you're doing it on purpose.
    Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

  • Tom (unregistered) in reply to Emil Vikström
    Emil Vikström:
    It will never beat the related Stack Overflow question on the subject. The accepted answer is hilarous. Read and enjoy: RegEx match open tags except XHTML self-contained tags
    Truth, brother, truth! Yea verily the apocalypse is upon us, even at the door.
  • Jamie Zawinski was right (unregistered)

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

    --jwz

  • Tom Mathias (unregistered)

    Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

  • Garrison Fiord (unregistered) in reply to Garrison Fiord
    Garrison Fiord:
    D-Coder:
    Garrison Fiord:
    qazwsx:
    Garrison Fiord:
    Mark Bowytz:
    With the ads, silly social media add-ons, sidebars, toolbars, and potentially WTF-level web page coding practices, unless you’re looking at a set of pure vanilla, consistently designed pages, it can be a big mess.

    Run-on sentence. C-. Are you man enough to admit your mistakes, or are you just going to buntcher my posts like that drunken narcissist Remy?

    Not every long sentence is a run-on.

    That one certainly is, though. Great job, Mark, on showing your not the bigger man.
    Okay, now we know you're doing it on purpose.
    Don't you idiots not understand that Mark is intentionally Bowitzing my posts? I speak good grammer all the time.

    I also haven't registered my name so I can't edit my posts and other strangers can just use it and make me sound like an idiot though I don't need they're halp cause I do that pretty good myself

  • (cs)

    Kinda looks like C#... How about a WebBrowser object (snippet):

    string result = string.Empty;
    webBrowser1.DocumentText = source;
    webBrowser1.DocumentCompleted += (sender, e) =>
        {
            HtmlDocument doc = webBrowser1.Document;
            result = doc.GetElementsByTagName("html")[0].InnerText;
        };
    

    Addendum (2012-09-12 10:41): (Yea, there are probably threading issues with this... just trying to illustrate the technique...)

  • RRDY (unregistered) in reply to Y_F
    Y_F:
    TRWTF is http://hotwired.lycos.com/webmonkey/reference/special_characters/

    That's the first thing that caught my eye. I wonder how many years it's been since that URL actually worked?

    Captcha: ACSI: The earliest SCSI prototype: about 18 versions prior to the final release.

  • William (unregistered)

    Obligatory regex + HTML = insanity: http://stackoverflow.com/a/1732454/213197

  • (cs) in reply to Tom Mathias
    Tom Mathias:
    Obligatory StackOverflow link to "that answer" about HTML and regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    Jeff Atwood's thoughts on it: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.htmlca

    Dang, beat me to it.

    Do I still get a silver medal for the "you have a problem, you use regular expressions, and now you have two problems" quote?

    Addendum (2012-09-12 11:19): EDIT:

    Dang again, beaten to that one as well.

  • (cs)

    The TRWTF here is the reality of trying to parse a very forgiving, interpreted language using a strict set of rules.

    http://www.codinghorror.com/blog/2007/04/javascript-and-html-forgiveness-by-default.html

  • iWantToKeepAnon (unregistered)

    My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.

  • Herr Otto Flick (unregistered) in reply to iWantToKeepAnon
    iWantToKeepAnon:
    My "solved several times over effectively" stack is saxon + tagsoup. Works like a charm. Tagsoup parses real world crap html and saxon xslt lets me slice and dice and make juliane fries.

    saxon is Java though, so that would make me sad. Slightly lighter is python, lxml and html5lib.

  • (cs) in reply to qazwsx
    qazwsx:
    Not every long sentence is a run-on.
    Don't feed the very obvious troll.
  • Neil (unregistered) in reply to Emil Vikström
    Emil Vikström:
    It will never beat the related Stack Overflow question on the subject. Read and enjoy: RegEx match open tags except XHTML self-contained tags
    That cad, Sam, got over 10% of his reputation from his answer to that question...
  • (cs) in reply to Jamie Zawinski was right
    Jamie Zawinski was right:
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have twenty-eight problems.

    FTFY

  • wyz (unregistered) in reply to redtetrahedron

    [quote user="redtetrahedron"]

    string result = string.Empty;
    webBrowser1.DocumentText = source;
    webBrowser1.DocumentCompleted += (sender, e) =>
    {
    HtmlDocument doc = webBrowser1.Document;
    // That's it.
    result = doc.GetElementsByTagName("html")[0].InnerText;
    };

    quote] FTFY, now it's consistent with today's topic.

  • Daniel Smedegaard Buus (unregistered)

    Is this what's called a regdep? Or, Regular Depression?

    At least this is comforting. It lets me know that I can drink and dope myself into oblivion. All I have to do to get a job in IT is to semi-sober up for a couple of hours to go to an interview.

  • myName (unregistered)

    I thought the WTF was the line about processing (only) 700k files. Does this refer to how many were done or that it could only handle files of that size?

    And we don't know the state of the original HTML, when the code was written, or whether this is code which has been changed because of changes to the HTML.

  • (cs)

    To be certified as a good content filter, any module should have to pass this set of tests:

    Strip these pages:

    1. yahoo.com
    2. five random Geocities sites from 2000
    3. five random MySpace pages

    To meet the following criteria:

    1. remove all ads and extraneous material
    2. preserve the important content
    3. the only formatting should be semantic formatting on the important content

    I call it the Flying Spaghetti Code test.

Leave a comment on “How to Extract Text from HTML (Experts Only)”

Log In or post as a guest

Replying to comment #:

« Return to Article