urllib2.HTTPError: HTTP Error 403: Forbidden

198,808

Solution 1

By adding a few more headers I was able to get the data:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content

Actually, it works with just this one additional header:

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

Solution 2

This will work in Python 3

import urllib.request

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers"
headers={'User-Agent':user_agent,} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need

Solution 3

NSE website has changed and the older scripts are semi-optimum to current website. This snippet can gather daily details of security. Details include symbol, security type, previous close, open price, high price, low price, average price, traded quantity, turnover, number of trades, deliverable quantities and ratio of delivered vs traded in percentage. These conveniently presented as list of dictionary form.

Python 3.X version with requests and BeautifulSoup

from requests import get
from csv import DictReader
from bs4 import BeautifulSoup as Soup
from datetime import date
from io import StringIO 

SECURITY_NAME="3MINDIA" # Change this to get quote for another stock
START_DATE= date(2017, 1, 1) # Start date of stock quote data DD-MM-YYYY
END_DATE= date(2017, 9, 14)  # End date of stock quote data DD-MM-YYYY


BASE_URL = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol={security}&segmentLink=3&symbolCount=1&series=ALL&dateRange=+&fromDate={start_date}&toDate={end_date}&dataType=PRICEVOLUMEDELIVERABLE"




def getquote(symbol, start, end):
    start = start.strftime("%-d-%-m-%Y")
    end = end.strftime("%-d-%-m-%Y")

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Referer': 'https://cssspritegenerator.com',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'Connection': 'keep-alive'}

    url = BASE_URL.format(security=symbol, start_date=start, end_date=end)
    d = get(url, headers=hdr)
    soup = Soup(d.content, 'html.parser')
    payload = soup.find('div', {'id': 'csvContentDiv'}).text.replace(':', '\n')
    csv = DictReader(StringIO(payload))
    for row in csv:
        print({k:v.strip() for k, v in row.items()})


 if __name__ == '__main__':
     getquote(SECURITY_NAME, START_DATE, END_DATE)

Besides this is relatively modular and ready to use snippet.

Solution 4

This error usually occurs when the server you are requesting doesn't know where the request is coming from, the server does this to avoid any unwanted visit. You could bypass this error by defining a header and passing it along the urllib.request

Heres code:

#defining header
header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
      'AppleWebKit/537.11 (KHTML, like Gecko) '
      'Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}

#the URL where you are requesting at
req = urllib.request.Request(url=your_url, headers=header) 
page = urllib.request.urlopen(req).read()

Solution 5

There is one thing worth trying is just to update the python version. One of my crawling scripts stopped working with 403 on Windows 10 a few months back. Any user_agents did not help and I was about to give up the script. Today I tried the same script on Ubuntu with Python (3.8.5 - 64 bit) and it worked with no error. The python version of Windows was a bit old as 3.6.2 - 32 bit. After upgrading the python on Windows 10 to 3.9.5 - 64bit, I don't see the 403 any longer. If you give it a try, don't forget to run 'pip freeze > requirements.txt" to export package entries. I forgot it of course. This post is a reminder for me too when the 403 comes back again in the future.

Share:
198,808
kumar
Author by

kumar

Have over 13 years of experience in the IT industry, completing projects in domains such as Health care, Automotive, Banking and Telecom. Have been developing highly scalable, distributed & large-scale web apps/APIs with Python/Django running on AWS/On-Premise infrastructure. Have migrated large scale .Net apps to Python, Python 2 apps to Python3, modernized front end with ReactJS, containerized micro services with docker/vagrant. Have designed, lead and developed web apps/APIs using micro service architecture utilizing Jenkins/Octopus for reliable and easy deployments. Have been working on creating an AI model for algorithmic trading using Neural Networks and NLP.

Updated on May 14, 2021

Comments

  • kumar
    kumar about 3 years

    I am trying to automate download of historic stock data using python. The URL I am trying to open responds with a CSV file, but I am unable to open using urllib2. I have tried changing user agent as specified in few questions earlier, I even tried to accept response cookies, with no luck. Can you please help.

    Note: The same method works for yahoo Finance.

    Code:

    import urllib2,cookielib
    
    site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
    
    hdr = {'User-Agent':'Mozilla/5.0'}
    
    req = urllib2.Request(site,headers=hdr)
    
    page = urllib2.urlopen(req)
    

    Error

    File "C:\Python27\lib\urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

    Thanks for your assistance

  • Admin
    Admin over 11 years
    Which of these headers to you think was missing from the origional request?
  • andrean
    andrean over 11 years
    wireshark showed that only the User-Agent was sent, along with Connection: close, Host: www.nseindia.com, Accept-Encoding: identity
  • kumar
    kumar over 11 years
    Andrean, thank you very much, it solved the issue, unfortunate and funny that I tried all headers except 'Accept' before posting here.
  • andrean
    andrean over 11 years
    You're welcome, well what I really did is I checked the url from your script in a browser, and as it worked there, I just copied all the request headers the browser sent, and added them here, and that was the solution.
  • araisbec
    araisbec about 11 years
    Thank you!! All of my requests were getting blocked from various forums, and this solved my problem. I think this should definitely be posted along with setting the User-Agent as a solution to the 403 error; This happened to me on numerous sites (I think most of them were running myBB).
  • sasdev
    sasdev almost 11 years
    It's true that some sites (including Wikipedia) block on common non-browser user agents strings, like the "Python-urllib/x.y" sent by Python's libraries. Even a plain "Mozilla" or "Opera" is usually enough to bypass that. This doesn't apply to the original question, of course, but it's still useful to know.
  • UserYmY
    UserYmY over 9 years
    @andrean How can I do this is python3 with urllib?
  • andrean
    andrean over 9 years
    @Mee did you take a look at the answer below? it was addressed specifically for python 3, check if it works for you...
  • UserYmY
    UserYmY over 9 years
    @andrean I still get this error when I use the below solution. am trying to get googlepageRanke. raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden
  • andrean
    andrean over 9 years
    try adding the other headers (from my answer) as well to the request. still there are many other reasons why a server might return a 403, check out the other answers on the topic as well. as for the target, google especially is a tough one, kinda hard to scrape, they have implemented many methods to prevent scraping.
  • Prabu
    Prabu over 7 years
    i was trying to download different url, for that it worked after removing Connection: Keep Alive. url : nseindia.com/content/historical/EQUITIES/2017/FEB/…
  • Nitish Kumar Pal
    Nitish Kumar Pal over 6 years
    Thanks, man! this worked for me instead of above answer from @andrean
  • Francesco
    Francesco about 6 years
    Hi, I really don't know where to bang my head anymore, I've tried this solution and many more but I keep getting error 403. Is there anything else I can try?
  • Supreet Sethi
    Supreet Sethi about 6 years
    403 status is meant to inform that your browser is not authenticated to use this service. It may be that in your case, it genuinely requires authentication with basic auth, oauth etc.
  • shaosh
    shaosh over 4 years
    I just need the user-agent to replace my previous old one.
  • neel
    neel over 4 years
    The code is working on the local but not working on the EC2 instance. Can you help me here?