How to handle urllib's timeout in Python 3?

68,885

Solution 1

Catch the different exceptions with explicit clauses, and check the reason for the exception with URLError (thank you Régis B.)

from socket import timeout
try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('HTTP Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
except URLError as error:
    if isinstance(error.reason, timeout):
        logging.error('Timeout Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
    else:
        logging.error('URL Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
    logging.info('Access successful.')

NB For recent comments, the original post referenced python 3.2 where you needed to catch timeout errors explicitly with socket.timeout. For example



    # Warning - python 3.2 code
    from socket import timeout
    
    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except timeout:
        logging.error('socket timed out - URL %s', url)

Solution 2

The previous answer does not correctly intercept timeout errors. Timeout errors are raised as URLError, so if we want to specifically catch them, we need to write:

from urllib.error import HTTPError, URLError
import socket

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('Data not retrieved because %s\nURL: %s', error, url)
except URLError as error:
    if isinstance(error.reason, socket.timeout):
        logging.error('socket timed out - URL %s', url)
    else:
        logging.error('some other error happened)
else:
    logging.info('Access successful.')

Note that ValueError can independently be raised, i.e. if the URL is invalid. Like HTTPError, it is not associated with a timeout.

Solution 3

What is a "timeout"? Holistically I think it means "a situation where the server didn't respond in time, typically because of high load, and it's worth retrying again."

HTTP status 504 "gateway timeout" would be a timeout under this definition. It's delivered via HTTPError.

HTTP status 429 "too many requests" would also be a timeout under that definition. It too is delivered via HTTPError.

Otherwise, what do we mean by a timeout? Do we include timeouts in resolving the domain name via the DNS resolver? timeouts when trying to send data? timeouts when waiting for the data to come back?

I don't know how to audit the source code of urllib to be sure that every possible way that I might consider a timeout, is being raised in a way that I'd catch. In a language without checked exceptions, I don't know how. I have a hunch that maybe connect-to-dns errors might be coming back as socket.timeout, and connect-to-remote-server errors might be coming back as URLError(socket.timeout)? It's just a guess that might explain earlier observations.

So I fell back to some really defensive coding. (1) I'm handling some HTTP status codes that are indicative of timeouts. (2) There are reports that some timeouts come via socket.timeout exceptions, and some via URLError(socket.timeout) exceptions, so I'm catching both. (3) And just in case, I threw in HTTPError(socket.timeout) as well.

while True:
    reason : Optional[str] = None
    try:
        with urllib.request.urlopen(url) as response:
            content = response.read()
            with open(cache,"wb") as file:
                file.write(content)
            return content
    except urllib.error.HTTPError as e:
        if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
            reason = f'{e.code} {str(e.reason)}'
        elif isinstance(e.reason, socket.timeout):
            reason = f'HTTPError socket.timeout {e.reason} - {e}'
        else:
            raise
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            reason = f'URLError socket.timeout {e.reason} - {e}'
        else:
            raise
    except socket.timeout as e:
        reason = f'socket.timeout {e}'
    except:
        raise
    netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
    print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
    time.sleep(5)
Share:
68,885
nindalf
Author by

nindalf

Hobby programmer who enjoys playing with C++ and Python.

Updated on January 28, 2021

Comments

  • nindalf
    nindalf over 3 years

    First off, my problem is quite similar to this one. I would like a timeout of urllib.urlopen() to generate an exception that I can handle.

    Doesn't this fall under URLError?

    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except (HTTPError, URLError) as error:
        logging.error(
            'Data of %s not retrieved because %s\nURL: %s', name, error, url)
    else:
        logging.info('Access successful.')
    

    The error message:

    resp = urllib.request.urlopen(req, timeout=10).read().decode('utf-8')
    File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
    return opener.open(url, data, timeout)
    File "/usr/lib/python3.2/urllib/request.py", line 369, in open
    response = self._open(req, data)
    File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
    '_open', req)
    File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
    result = func(*args)
    File "/usr/lib/python3.2/urllib/request.py", line 1156, in http_open
    return self.do_open(http.client.HTTPConnection, req)
    File "/usr/lib/python3.2/urllib/request.py", line 1141, in do_open
    r = h.getresponse()
    File "/usr/lib/python3.2/http/client.py", line 1046, in getresponse
    response.begin()
    File "/usr/lib/python3.2/http/client.py", line 346, in begin
    version, status, reason = self._read_status()
    File "/usr/lib/python3.2/http/client.py", line 308, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    File "/usr/lib/python3.2/socket.py", line 276, in readinto
    return self._sock.recv_into(b)
    socket.timeout: timed out

    There was a major change from in Python 3 when they re-organised the urllib and urllib2 modules into urllib. Is it possible that there was a change then that causes this?

  • Asclepius
    Asclepius over 5 years
    I once had a socket.timeout exception despite this code. It was not caught by this code. It happened just once among many attempts. The code is correct for the most part though in that catching URLError catches most timeout errors. This is with Python 3.7.2. In summary, to be safer, I'm also catching socket.timeout.
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 2 years
    This answer has six question marks, two "I don't know"'s and one "I think". There is little confidence one should copy and paste this code into their program. A lot of hacking to break into one's working system is probably required to test the copy and pasted code. To top it all off there is an infinite loop with 5 second sleep.
  • Lucian Wischik
    Lucian Wischik over 2 years
    There are six question marks because I'm asking questions that the other answers didn't consider! I agree not to copy+paste. Instead, you should write whatever code you want, then verify whether it adequately handles the questions I raised, and if not then adjust. If you finally end up with code different from mine then you should ask yourself why. (PS. it's easy to audit that the infinite loop comes only in finite set of cases 504, 429, socket.timeout, since that's what I wanted, and if you don't want that then it's clear what to change!)
  • Otheus
    Otheus about 2 years
    This is absolutely incorrect! In Python 3.9, only the first exception is caught. Perhaps some change was introduced between 3 and 3.9?