Future exception was never retrieved

12,256

not an answer to your question, but perhaps a solution to your problem either way depending on if you just want to get the code working or not.

I would validate the URLS before i request them. i've had alot of headaches with this kind of stuff trying to harvest some data, so i decided to fix them upfront ,and report malformed urls to a log.

You can use django's regex or other code to do this as it's publicly availible.

In this question a person gives the validation regex for django. Python - How to validate a url in python ? (Malformed or not)

Share:
12,256
Charnel
Author by

Charnel

Updated on June 25, 2022

Comments

  • Charnel
    Charnel almost 2 years

    I have a scraper (based on Python 3.4.2 and asyncio/aiohttp libs) and bunch of links (> 10K) to retrive some small amount of data. Part of scraper code:

    @asyncio.coroutine
    def prepare(self, links):
        semaphore = asyncio.Semaphore(self.limit_concurrent)
        tasks = []
        result = []
    
        tasks = [self.request_data(link, semaphore) for link in links]
    
        for task in asyncio.as_completed(tasks):
            response = yield from task
            if response:
                result.append(response)
            task.close()
        return result
    
    @asyncio.coroutine
    def request_data(self, link, semaphore):
    
        ...
    
        with (yield from semaphore):
            while True:
                counter += 1
                if counter >= self.retry:
                    break
                with aiohttp.Timeout(self.timeout):
                    try:
                        response = yield from self.session.get(url, headers=self.headers)
                        body = yield from response.read()
                        break
                    except asyncio.TimeoutError as err:
                        logging.warning('Timeout error getting {0}'.format(url))
                        return None
                    except Exception:
                        return None
        ...
    

    Whan it trying to make requests to malformed URL's I get messages like this:

    Future exception was never retrieved
    future: <Future finished exception=gaierror(11004, 'getaddrinfo failed')>
    Traceback (most recent call last):
      File "H:\Python_3_4_2\lib\concurrent\futures\thread.py", line 54, in run
        result = self.fn(*self.args, **self.kwargs)
      File "H:\Python_3_4_2\lib\socket.py", line 530, in getaddrinfo
        for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    socket.gaierror: [Errno 11004] getaddrinfo failed
    

    The error occures when trying to yield response from session.get. As I understand the exception was never consumed by asyncio and so it wasn't "babble up".

    First I tryed to simply wrap request by try/except:

    try:
        response = yield from self.session.get(url, headers=self.headers)
    except Exception:
        return None
    

    This doesn't work.

    Then I read here about chaining coroutines to catch exception but this didn't work for me either. I still get those messages and script crashes after certain amount of time.

    So my question - how I can handle this exception in proper way?

  • Charnel
    Charnel over 7 years
    Yes, I was looking to that direction but there are other problem. Small research shows that not all links causes this error are malformed. Some of them have just redirects or WebSocket's servers instead of http(s). I think in this case it's better to be able to catch exception.
  • user7296055
    user7296055 over 7 years
    perhaps you can post a debug traceback via the methods discussed here: docs.python.org/3/library/… maybe it will give more information on what is triggering it exactly. Either way good luck, scrapers can be a nightmare to maintain and keep running ^^