Scrape a web page that requires they give you a session cookie first

python google-app-engine web-scraping urllib2

21,461

Solution 1

Using requests this is a trivial task:

>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)

>>> print r.cookies
{'requests-is': 'awesome'}

Solution 2

Using cookies and urllib2:

import cookielib
import urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls

You can use the same opener for several connections:

data = [opener.open(url).read() for url in urls]

Or install it globally:

urllib2.install_opener(opener)

In the latter case the rest of the code looks the same with or without cookies support:

data = [urllib2.urlopen(url).read() for url in urls]

21,461

Author by

rd108

Updated on March 19, 2020

Comments

rd108 about 4 years
I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:

http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal

requires that I have a session cookie from the government site attached to the request.

How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.

I tried this:
```
import urllib2
import cookielib

url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'


def grab_data_with_cookie(cookie_jar, url):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
    data = opener.open(url)
    return data

cj = cookielib.CookieJar()

#grab the data 
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)

stuff2  = data2.read()
```
I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?