Scrape a web page that requires they give you a session cookie first

21,461

Solution 1

Using requests this is a trivial task:

>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)

>>> print r.cookies
{'requests-is': 'awesome'}

Solution 2

Using cookies and urllib2:

import cookielib
import urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls

You can use the same opener for several connections:

data = [opener.open(url).read() for url in urls]

Or install it globally:

urllib2.install_opener(opener)

In the latter case the rest of the code looks the same with or without cookies support:

data = [urllib2.urlopen(url).read() for url in urls]
Share:
21,461
rd108
Author by

rd108

Updated on March 19, 2020

Comments

  • rd108
    rd108 about 4 years

    I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:

    http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal

    requires that I have a session cookie from the government site attached to the request.

    How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.

    I tried this:

    import urllib2
    import cookielib
    
    url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'
    
    
    def grab_data_with_cookie(cookie_jar, url):
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
        data = opener.open(url)
        return data
    
    cj = cookielib.CookieJar()
    
    #grab the data 
    data1 = grab_data_with_cookie(cj, url)
    #the second time we do this, we get back the excel sheet.
    data2 = grab_data_with_cookie(cj, url)
    
    stuff2  = data2.read()
    

    I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?