How to catch 404 error in urllib.urlretrieve

27,127

Solution 1

Check out urllib.urlretrieve's complete code:

def urlretrieve(url, filename=None, reporthook=None, data=None):
  global _urlopener
  if not _urlopener:
    _urlopener = FancyURLopener()
  return _urlopener.retrieve(url, filename, reporthook, data)

In other words, you can use urllib.FancyURLopener (it's part of the public urllib API). You can override http_error_default to detect 404s:

class MyURLopener(urllib.FancyURLopener):
  def http_error_default(self, url, fp, errcode, errmsg, headers):
    # handle errors the way you'd like to

fn, h = MyURLopener().retrieve(url, reporthook=my_report_hook)

Solution 2

You should use:

import urllib2

try:
    resp = urllib2.urlopen("http://www.google.com/this-gives-a-404/")
except urllib2.URLError, e:
    if not hasattr(e, "code"):
        raise
    resp = e

print "Gave", resp.code, resp.msg
print "=" * 80
print resp.read(80)

Edit: The rationale here is that unless you expect the exceptional state, it is an exception for it to happen, and you probably didn't even think about it -- so instead of letting your code continue to run while it was unsuccessful, the default behavior is--quite sensibly--to inhibit its execution.

Solution 3

The URL Opener object's "retreive" method supports the reporthook and throws an exception on 404.

http://docs.python.org/library/urllib.html#url-opener-objects

Share:
27,127
Sridhar Ratnakumar
Author by

Sridhar Ratnakumar

Updated on July 13, 2022

Comments

  • Sridhar Ratnakumar
    Sridhar Ratnakumar almost 2 years

    Background: I am using urllib.urlretrieve, as opposed to any other function in the urllib* modules, because of the hook function support (see reporthook below) .. which is used to display a textual progress bar. This is Python >=2.6.

    >>> urllib.urlretrieve(url[, filename[, reporthook[, data]]])
    

    However, urlretrieve is so dumb that it leaves no way to detect the status of the HTTP request (eg: was it 404 or 200?).

    >>> fn, h = urllib.urlretrieve('http://google.com/foo/bar')
    >>> h.items() 
    [('date', 'Thu, 20 Aug 2009 20:07:40 GMT'),
     ('expires', '-1'),
     ('content-type', 'text/html; charset=ISO-8859-1'),
     ('server', 'gws'),
     ('cache-control', 'private, max-age=0')]
    >>> h.status
    ''
    >>>
    

    What is the best known way to download a remote HTTP file with hook-like support (to show progress bar) and a decent HTTP error handling?