urllib2.HTTPError: HTTP Error 403: Forbidden
Solution 1
By adding a few more headers I was able to get the data:
import urllib2,cookielib
site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
Actually, it works with just this one additional header:
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
Solution 2
This will work in Python 3
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
Solution 3
NSE website has changed and the older scripts are semi-optimum to current website. This snippet can gather daily details of security. Details include symbol, security type, previous close, open price, high price, low price, average price, traded quantity, turnover, number of trades, deliverable quantities and ratio of delivered vs traded in percentage. These conveniently presented as list of dictionary form.
Python 3.X version with requests and BeautifulSoup
from requests import get
from csv import DictReader
from bs4 import BeautifulSoup as Soup
from datetime import date
from io import StringIO
SECURITY_NAME="3MINDIA" # Change this to get quote for another stock
START_DATE= date(2017, 1, 1) # Start date of stock quote data DD-MM-YYYY
END_DATE= date(2017, 9, 14) # End date of stock quote data DD-MM-YYYY
BASE_URL = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol={security}&segmentLink=3&symbolCount=1&series=ALL&dateRange=+&fromDate={start_date}&toDate={end_date}&dataType=PRICEVOLUMEDELIVERABLE"
def getquote(symbol, start, end):
start = start.strftime("%-d-%-m-%Y")
end = end.strftime("%-d-%-m-%Y")
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer': 'https://cssspritegenerator.com',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = BASE_URL.format(security=symbol, start_date=start, end_date=end)
d = get(url, headers=hdr)
soup = Soup(d.content, 'html.parser')
payload = soup.find('div', {'id': 'csvContentDiv'}).text.replace(':', '\n')
csv = DictReader(StringIO(payload))
for row in csv:
print({k:v.strip() for k, v in row.items()})
if __name__ == '__main__':
getquote(SECURITY_NAME, START_DATE, END_DATE)
Besides this is relatively modular and ready to use snippet.
Solution 4
This error usually occurs when the server you are requesting doesn't know where the request is coming from, the server does this to avoid any unwanted visit. You could bypass this error by defining a header and passing it along the urllib.request
Heres code:
#defining header
header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
#the URL where you are requesting at
req = urllib.request.Request(url=your_url, headers=header)
page = urllib.request.urlopen(req).read()
Solution 5
There is one thing worth trying is just to update the python version. One of my crawling scripts stopped working with 403 on Windows 10 a few months back. Any user_agents did not help and I was about to give up the script. Today I tried the same script on Ubuntu with Python (3.8.5 - 64 bit) and it worked with no error. The python version of Windows was a bit old as 3.6.2 - 32 bit. After upgrading the python on Windows 10 to 3.9.5 - 64bit, I don't see the 403 any longer. If you give it a try, don't forget to run 'pip freeze > requirements.txt" to export package entries. I forgot it of course. This post is a reminder for me too when the 403 comes back again in the future.
kumar
Have over 13 years of experience in the IT industry, completing projects in domains such as Health care, Automotive, Banking and Telecom. Have been developing highly scalable, distributed & large-scale web apps/APIs with Python/Django running on AWS/On-Premise infrastructure. Have migrated large scale .Net apps to Python, Python 2 apps to Python3, modernized front end with ReactJS, containerized micro services with docker/vagrant. Have designed, lead and developed web apps/APIs using micro service architecture utilizing Jenkins/Octopus for reliable and easy deployments. Have been working on creating an AI model for algorithmic trading using Neural Networks and NLP.
Updated on May 14, 2021Comments
-
kumar about 3 years
I am trying to automate download of historic stock data using python. The URL I am trying to open responds with a CSV file, but I am unable to open using urllib2. I have tried changing user agent as specified in few questions earlier, I even tried to accept response cookies, with no luck. Can you please help.
Note: The same method works for yahoo Finance.
Code:
import urllib2,cookielib site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true" hdr = {'User-Agent':'Mozilla/5.0'} req = urllib2.Request(site,headers=hdr) page = urllib2.urlopen(req)
Error
File "C:\Python27\lib\urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden
Thanks for your assistance
-
Admin over 11 yearsWhich of these headers to you think was missing from the origional request?
-
andrean over 11 yearswireshark showed that only the User-Agent was sent, along with Connection: close, Host: www.nseindia.com, Accept-Encoding: identity
-
kumar over 11 yearsAndrean, thank you very much, it solved the issue, unfortunate and funny that I tried all headers except 'Accept' before posting here.
-
andrean over 11 yearsYou're welcome, well what I really did is I checked the url from your script in a browser, and as it worked there, I just copied all the request headers the browser sent, and added them here, and that was the solution.
-
araisbec about 11 yearsThank you!! All of my requests were getting blocked from various forums, and this solved my problem. I think this should definitely be posted along with setting the User-Agent as a solution to the 403 error; This happened to me on numerous sites (I think most of them were running myBB).
-
sasdev almost 11 yearsIt's true that some sites (including Wikipedia) block on common non-browser user agents strings, like the "Python-urllib/x.y" sent by Python's libraries. Even a plain "Mozilla" or "Opera" is usually enough to bypass that. This doesn't apply to the original question, of course, but it's still useful to know.
-
UserYmY over 9 years@andrean How can I do this is python3 with urllib?
-
andrean over 9 years@Mee did you take a look at the answer below? it was addressed specifically for python 3, check if it works for you...
-
UserYmY over 9 years@andrean I still get this error when I use the below solution. am trying to get googlepageRanke. raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden
-
andrean over 9 yearstry adding the other headers (from my answer) as well to the request. still there are many other reasons why a server might return a 403, check out the other answers on the topic as well. as for the target, google especially is a tough one, kinda hard to scrape, they have implemented many methods to prevent scraping.
-
Prabu over 7 yearsi was trying to download different url, for that it worked after removing Connection: Keep Alive. url : nseindia.com/content/historical/EQUITIES/2017/FEB/…
-
Nitish Kumar Pal over 6 yearsThanks, man! this worked for me instead of above answer from @andrean
-
Francesco about 6 yearsHi, I really don't know where to bang my head anymore, I've tried this solution and many more but I keep getting error 403. Is there anything else I can try?
-
Supreet Sethi about 6 years403 status is meant to inform that your browser is not authenticated to use this service. It may be that in your case, it genuinely requires authentication with basic auth, oauth etc.
-
shaosh over 4 yearsI just need the user-agent to replace my previous old one.
-
neel over 4 yearsThe code is working on the local but not working on the EC2 instance. Can you help me here?