BeautifulSoup get_text does not strip all tags and JavaScript

python html xml screen-scraping beautifulsoup

14,973

Solution 1

nltk's clean_html() is quite good at this!

Assuming that your already have your html stored in a variable html like

html = urllib.urlopen(address).read()

then just use

import nltk
clean_text = nltk.clean_html(html)

UPDATE

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

An example on how to achieve this is on this page:

BeatifulSoup4 get_text still has javascript

Solution 2

Here's an approach which is based on the answer here: BeautifulSoup Grab Visible Webpage Text by jbochi. This approach allows for comments embedded in elements containing page text, and does a bit to clean up the output by stripping newlines, consolidating space, etc.

html = urllib.urlopen(address).read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible_text(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return ''
    result = re.sub('<!--.*-->|\r|\n', '', str(element), flags=re.DOTALL)
    result = re.sub('\s{2,}|&nbsp;', ' ', result)
    return result

visible_elements = [visible_text(elem) for elem in texts]
visible_text = ''.join(visible_elements)
print(visible_text)

Solution 3

This was the problem I was having. no solution seemed to be able to return the text (the text that would actually be rendered in the web broswer). Other solutions mentioned that BS is not ideal for rendering and that html2text was a good approach. I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speed delta might highly depend on the contents of the data...

One answer here from @Helge was about using nltk of all things.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

14,973

Author by

piokuc

My most beautiful answer ever: Q: http://stackoverflow.com/questions/14902686/turn-flat-list-into-two-tuples/14902743#14902743 (in Python) A: http://stackoverflow.com/a/14902743/300886

Updated on June 16, 2022

Comments

piokuc almost 2 years

I am trying to use BeautifulSoup to get text from web pages.

Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.

import sys
from bs4 import BeautifulSoup

def stripTags(s): return BeautifulSoup(s).get_text()

def stripTagsFromFile(inFile, outFile):
    open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))

def main(argv):
    if len(sys.argv) <> 3:
        print 'Usage:\t\t', sys.argv[0], 'input.html output.txt'
        return 1
    stripTagsFromFile(sys.argv[1], sys.argv[2])
    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv))

Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    Education Manager  Job In London With  Caleeda | Great Jobs In Teaching

var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);

Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.

Any help will be much appreciated.