Python requests with multithreading

62,468

Install the grequests module which works with gevent (requests is not designed for async):

pip install grequests

Then change the code to something like this:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommended by the requests project:

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

Using this method gives me a noticable performance increase with 10 URLs: 0.877s vs 3.852s with your original method.

Share:
62,468
krypt
Author by

krypt

Updated on December 30, 2020

Comments

  • krypt
    krypt over 3 years

    I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

    Here is the relevant part of my class:

    import gevent.monkey
    from gevent.pool import Pool
    import requests
    
    gevent.monkey.patch_all()
    
    class Test:
        def __init__(self):
            self.session = requests.Session()
            self.pool = Pool(20)
            self.urls = [...urls...]
    
        def fetch(self, url):
    
            try:
                response = self.session.get(url, headers=self.headers)
            except:
                self.logger.error('Problem: ', id, exc_info=True)
    
            self.doSomething(response)
    
        def async(self):
            for url in self.urls:
                self.pool.spawn( self.fetch, url )
    
            self.pool.join()
    
    test = Test()
    test.async()