BeautifulSoup get_text does not strip all tags and JavaScript
Solution 1
nltk's clean_html()
is quite good at this!
Assuming that your already have your html stored in a variable html
like
html = urllib.urlopen(address).read()
then just use
import nltk
clean_text = nltk.clean_html(html)
UPDATE
Support for clean_html
and clean_url
will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.
An example on how to achieve this is on this page:
BeatifulSoup4 get_text still has javascript
Solution 2
Here's an approach which is based on the answer here: BeautifulSoup Grab Visible Webpage Text by jbochi. This approach allows for comments embedded in elements containing page text, and does a bit to clean up the output by stripping newlines, consolidating space, etc.
html = urllib.urlopen(address).read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
def visible_text(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return ''
result = re.sub('<!--.*-->|\r|\n', '', str(element), flags=re.DOTALL)
result = re.sub('\s{2,}| ', ' ', result)
return result
visible_elements = [visible_text(elem) for elem in texts]
visible_text = ''.join(visible_elements)
print(visible_text)
Solution 3
This was the problem I was having. no solution seemed to be able to return the text (the text that would actually be rendered in the web broswer). Other solutions mentioned that BS is not ideal for rendering and that html2text was a good approach. I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speed delta might highly depend on the contents of the data...
One answer here from @Helge was about using nltk of all things.
import nltk
%timeit nltk.clean_html(html)
was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
piokuc
My most beautiful answer ever: Q: http://stackoverflow.com/questions/14902686/turn-flat-list-into-two-tuples/14902743#14902743 (in Python) A: http://stackoverflow.com/a/14902743/300886
Updated on June 16, 2022Comments
-
piokuc almost 2 years
I am trying to use BeautifulSoup to get text from web pages.
Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.
import sys from bs4 import BeautifulSoup def stripTags(s): return BeautifulSoup(s).get_text() def stripTagsFromFile(inFile, outFile): open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8")) def main(argv): if len(sys.argv) <> 3: print 'Usage:\t\t', sys.argv[0], 'input.html output.txt' return 1 stripTagsFromFile(sys.argv[1], sys.argv[2]) return 0 if __name__ == "__main__": sys.exit(main(sys.argv))
Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" Education Manager Job In London With Caleeda | Great Jobs In Teaching var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-15255540-21']); _gaq.push(['_trackPageview']); _gaq.push(['_trackPageLoadTime']);
Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.
Any help will be much appreciated.