How to handle urllib's timeout in Python 3?
Solution 1
Catch the different exceptions with explicit clauses, and check the reason for the exception with URLError (thank you Régis B.)
from socket import timeout
try:
response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
logging.error('HTTP Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
except URLError as error:
if isinstance(error.reason, timeout):
logging.error('Timeout Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
logging.error('URL Error: Data of %s not retrieved because %s\nURL: %s', name, error, url)
else:
logging.info('Access successful.')
NB For recent comments, the original post referenced python 3.2 where you needed to catch timeout errors explicitly with socket.timeout
. For example
# Warning - python 3.2 code
from socket import timeout
try:
response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except timeout:
logging.error('socket timed out - URL %s', url)
Solution 2
The previous answer does not correctly intercept timeout errors. Timeout errors are raised as URLError
, so if we want to specifically catch them, we need to write:
from urllib.error import HTTPError, URLError
import socket
try:
response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
logging.error('Data not retrieved because %s\nURL: %s', error, url)
except URLError as error:
if isinstance(error.reason, socket.timeout):
logging.error('socket timed out - URL %s', url)
else:
logging.error('some other error happened)
else:
logging.info('Access successful.')
Note that ValueError
can independently be raised, i.e. if the URL is invalid. Like HTTPError
, it is not associated with a timeout.
Solution 3
What is a "timeout"? Holistically I think it means "a situation where the server didn't respond in time, typically because of high load, and it's worth retrying again."
HTTP status 504 "gateway timeout" would be a timeout under this definition. It's delivered via HTTPError.
HTTP status 429 "too many requests" would also be a timeout under that definition. It too is delivered via HTTPError.
Otherwise, what do we mean by a timeout? Do we include timeouts in resolving the domain name via the DNS resolver? timeouts when trying to send data? timeouts when waiting for the data to come back?
I don't know how to audit the source code of urllib to be sure that every possible way that I might consider a timeout, is being raised in a way that I'd catch. In a language without checked exceptions, I don't know how. I have a hunch that maybe connect-to-dns errors might be coming back as socket.timeout, and connect-to-remote-server errors might be coming back as URLError(socket.timeout)? It's just a guess that might explain earlier observations.
So I fell back to some really defensive coding. (1) I'm handling some HTTP status codes that are indicative of timeouts. (2) There are reports that some timeouts come via socket.timeout exceptions, and some via URLError(socket.timeout) exceptions, so I'm catching both. (3) And just in case, I threw in HTTPError(socket.timeout) as well.
while True:
reason : Optional[str] = None
try:
with urllib.request.urlopen(url) as response:
content = response.read()
with open(cache,"wb") as file:
file.write(content)
return content
except urllib.error.HTTPError as e:
if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
reason = f'{e.code} {str(e.reason)}'
elif isinstance(e.reason, socket.timeout):
reason = f'HTTPError socket.timeout {e.reason} - {e}'
else:
raise
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
reason = f'URLError socket.timeout {e.reason} - {e}'
else:
raise
except socket.timeout as e:
reason = f'socket.timeout {e}'
except:
raise
netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
time.sleep(5)
nindalf
Hobby programmer who enjoys playing with C++ and Python.
Updated on January 28, 2021Comments
-
nindalf over 3 years
First off, my problem is quite similar to this one. I would like a timeout of urllib.urlopen() to generate an exception that I can handle.
Doesn't this fall under URLError?
try: response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8') except (HTTPError, URLError) as error: logging.error( 'Data of %s not retrieved because %s\nURL: %s', name, error, url) else: logging.info('Access successful.')
The error message:
resp = urllib.request.urlopen(req, timeout=10).read().decode('utf-8')
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1156, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1141, in do_open
r = h.getresponse()
File "/usr/lib/python3.2/http/client.py", line 1046, in getresponse
response.begin()
File "/usr/lib/python3.2/http/client.py", line 346, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.2/http/client.py", line 308, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.2/socket.py", line 276, in readinto
return self._sock.recv_into(b)
socket.timeout: timed outThere was a major change from in Python 3 when they re-organised the
urllib
andurllib2
modules intourllib
. Is it possible that there was a change then that causes this? -
Asclepius over 5 yearsI once had a
socket.timeout
exception despite this code. It was not caught by this code. It happened just once among many attempts. The code is correct for the most part though in that catchingURLError
catches most timeout errors. This is with Python 3.7.2. In summary, to be safer, I'm also catchingsocket.timeout
. -
WinEunuuchs2Unix over 2 yearsThis answer has six question marks, two "I don't know"'s and one "I think". There is little confidence one should copy and paste this code into their program. A lot of hacking to break into one's working system is probably required to test the copy and pasted code. To top it all off there is an infinite loop with 5 second sleep.
-
Lucian Wischik over 2 yearsThere are six question marks because I'm asking questions that the other answers didn't consider! I agree not to copy+paste. Instead, you should write whatever code you want, then verify whether it adequately handles the questions I raised, and if not then adjust. If you finally end up with code different from mine then you should ask yourself why. (PS. it's easy to audit that the infinite loop comes only in finite set of cases 504, 429, socket.timeout, since that's what I wanted, and if you don't want that then it's clear what to change!)
-
Otheus about 2 yearsThis is absolutely incorrect! In Python 3.9, only the first exception is caught. Perhaps some change was introduced between 3 and 3.9?