Future exception was never retrieved

python exception-handling python-asyncio coroutine aiohttp

12,256

not an answer to your question, but perhaps a solution to your problem either way depending on if you just want to get the code working or not.

I would validate the URLS before i request them. i've had alot of headaches with this kind of stuff trying to harvest some data, so i decided to fix them upfront ,and report malformed urls to a log.

You can use django's regex or other code to do this as it's publicly availible.

In this question a person gives the validation regex for django. Python - How to validate a url in python ? (Malformed or not)

12,256

Author by

Charnel

Updated on June 25, 2022

Comments

Charnel almost 2 years

I have a scraper (based on Python 3.4.2 and asyncio/aiohttp libs) and bunch of links (> 10K) to retrive some small amount of data. Part of scraper code:

@asyncio.coroutine
def prepare(self, links):
    semaphore = asyncio.Semaphore(self.limit_concurrent)
    tasks = []
    result = []

    tasks = [self.request_data(link, semaphore) for link in links]

    for task in asyncio.as_completed(tasks):
        response = yield from task
        if response:
            result.append(response)
        task.close()
    return result

@asyncio.coroutine
def request_data(self, link, semaphore):

    ...

    with (yield from semaphore):
        while True:
            counter += 1
            if counter >= self.retry:
                break
            with aiohttp.Timeout(self.timeout):
                try:
                    response = yield from self.session.get(url, headers=self.headers)
                    body = yield from response.read()
                    break
                except asyncio.TimeoutError as err:
                    logging.warning('Timeout error getting {0}'.format(url))
                    return None
                except Exception:
                    return None
    ...

Whan it trying to make requests to malformed URL's I get messages like this:

Future exception was never retrieved
future: <Future finished exception=gaierror(11004, 'getaddrinfo failed')>
Traceback (most recent call last):
  File "H:\Python_3_4_2\lib\concurrent\futures\thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "H:\Python_3_4_2\lib\socket.py", line 530, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed

The error occures when trying to yield response from session.get. As I understand the exception was never consumed by asyncio and so it wasn't "babble up".

First I tryed to simply wrap request by try/except:

try:
    response = yield from self.session.get(url, headers=self.headers)
except Exception:
    return None

This doesn't work.

Then I read here about chaining coroutines to catch exception but this didn't work for me either. I still get those messages and script crashes after certain amount of time.

So my question - how I can handle this exception in proper way?

Charnel over 7 years

Yes, I was looking to that direction but there are other problem. Small research shows that not all links causes this error are malformed. Some of them have just redirects or WebSocket's servers instead of http(s). I think in this case it's better to be able to catch exception.
user7296055 over 7 years

perhaps you can post a debug traceback via the methods discussed here: docs.python.org/3/library/… maybe it will give more information on what is triggering it exactly. Either way good luck, scrapers can be a nightmare to maintain and keep running ^^