How to read an entire web page into a variable

26,915

Solution 1

You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!

Solution 2

You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like

data = urllib2.urlopen(url)
print data.read()

should give you the entire webpage.

From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.

Solution 3

Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:

This function returns a file-like object

This is what I got :

print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>

readlines() returns list of lines of html source and you can store it in a string like :

import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
    l.append(line)
s = '\n'.join(l)

You can either use list l or string s, according to your need.

Share:
26,915
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.

    I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?

     data = urllib2.urlopen(url)
     print data
    

    Only gives me about 1/3 of the source.

     data = urllib2.urlopen(url)
     for lines in data.readlines()
          print lines
    

    This gives me the entire source.

    Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.