BeautifulSoup can't parse a webpage?

11,536

Solution 1

From the docs:

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen # py3k

from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

HTMLParser.py - more robust SCRIPT tag parsing bug might be related.

Solution 2

You cannot use BeautifulSoup nor any HTML parser to read web pages. You are never guaranteed that web page is a well formed document. Let me explain what is happening in this given case.

On that page there is this INLINE javascript:

var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";

You can see that it's creating a string that will put a script tag onto the page. Now, if you're an HTML parser, this is a very tricky thing to deal with. You go along reading your tokens when suddenly you hit a <script> tag. Now, unfortunately, if you did this:

<script>
alert('hello');
<script>
alert('goodby');

Most parsers would say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgot to close the first one! And the parser would think both are valid scripts.

So, in this case, BeautifulSoup sees a <script> tag, and even though it's inside a javascript string, it looks like it could be a valid starting tag, and BeautifulSoup has a seizure, as well it should.

If you look at the string again, you can see they do this interesting piece of work:

... "</" + "script>";

This seems odd right? Wouldn't it be better to just do str = " ... </script>" without doing an extra string concatination? This is actually a common trick (by silly people who write script tags as strings, a bad practice) to make the parser NOT break. Because if you do this:

var a = '</script>';

in an inline script, the parser will come along and really just see </script> and think the whole script tag has ended, and will throw up the rest of the contents of that script tag onto the page as plain text. This is because you can technically put a closing script tag anywhere, even if your JS syntax is invalid. From a parser point of view, it's better to get out of the script tag early rather than try to render your html code as javascript.

So, you can't use a regular HTML parser to parse web pages. It's a very, very dangerous game. There is no guarantee you'll get well formed HTML. Depending on what you're trying to do, you could read the content of the page with a regex, or try getting a fully rendered page content with a headless browser

Solution 3

you need to use html5lib parser with BeautifulSoup

To install the reqd parser use pip:

pip install html5lib

then use that parser this way

import mechanize
br = mechanize.Browser()
html = br.open("http://google.com/",timeout=100).read()
soup = BeautifulSoup(html,'html5lib')
a_s = soup.find_all('a')
for i in range(0,len(a_s)):
 print a_s[i]['href']

Solution 4

One of the Simplest thing you can do is, specify the content as "lxml". you can do it by adding "lxml" to the urlopen() function as a parameter

page = urllib2.urlopen("[url]","lxml")

Then your code will be as follow.

import urllib2from bs4 import BeautifulSoup page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1","lxml") soup = BeautifulSoup(page) print soup.prettify()

So far i didn't get any trouble from this approach :)

Share:
11,536
JLTChiu
Author by

JLTChiu

Updated on August 12, 2022

Comments

  • JLTChiu
    JLTChiu over 1 year

    I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.

    Here's what I did

    import urllib2
    from bs4 import BeautifulSoup
    
    page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
    soup = BeautifulSoup(page)
    print soup.prettify()
    

    I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:

    Warning (from warnings module):

    File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

    "Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

    ...

    HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

    I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?

  • poke
    poke over 11 years
    “You cannot use […] any HTML parser to read web pages” – I think that’s a false statement. Web browsers do exactly that, they use a well-developed HTML parser to parse the content of webpages. Of course they add a lot more features above it, evaluating scripts and all the stuff that, but they are still parsing the base HTML first. In this case, the built-in parser does not seem to be capable enough to accept the particular HTML (although it does work fine for me and Vor too), so a more capable parser would be required. It still stays a HTML parser though.
  • JLTChiu
    JLTChiu over 11 years
    I think I am using Python2.7.2 (Currently I can not use that computer, so I am not 100% sure). So if I install better parser like lxml, I don't have to modify my code at all? (I think the try and except part are for urllib not related to Beautifulsoup). Just wish to make sure that I understand it correctly. Thanks.
  • jfs
    jfs over 11 years
    @JLTChiu: yes, you don't need to modify the code. try/except is to able to run the same script on both Python 2 and Python 3 (urllib2 on Python 2, and urllib.request on Python 3)
  • JLTChiu
    JLTChiu over 11 years
    Thank you so much, I really appreciate your help.