Scrape a web page that requires they give you a session cookie first
Solution 1
Using requests this is a trivial task:
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
Solution 2
Using cookies and urllib2
:
import cookielib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls
You can use the same opener for several connections:
data = [opener.open(url).read() for url in urls]
Or install it globally:
urllib2.install_opener(opener)
In the latter case the rest of the code looks the same with or without cookies support:
data = [urllib2.urlopen(url).read() for url in urls]
rd108
Updated on March 19, 2020Comments
-
rd108 about 4 years
I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:
requires that I have a session cookie from the government site attached to the request.
How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.
I tried this:
import urllib2 import cookielib url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal' def grab_data_with_cookie(cookie_jar, url): opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar)) data = opener.open(url) return data cj = cookielib.CookieJar() #grab the data data1 = grab_data_with_cookie(cj, url) #the second time we do this, we get back the excel sheet. data2 = grab_data_with_cookie(cj, url) stuff2 = data2.read()
I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?