Scrapy using pool of random proxies to avoid being banned

10,117

Solution 1

There is already a library to do this. https://github.com/aivarsk/scrapy-proxies

Please download it from there. It has not been in pypi.org yet, so you can't install it easily using pip or easy_install.

Solution 2

  1. There's not a correct answer for this. Some proxies are not always available so you have to check them now and then. Also, if you use the same proxy every time the server you are scraping may block its IP as well, but that depends on the security mechanisms this server has.
  2. Yes, because you don't know if all the proxies you have in your pool support HTTPS. Or you could have just one pool and add a field to each proxy that indicates its HTTPS support.
  3. In your settings your are disabling the user agent middleware: 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None. The USER_AGENT setting won't have any effect.
Share:
10,117

Related videos on Youtube

Inês Martins
Author by

Inês Martins

Junior data scientist. Pylady.

Updated on September 18, 2022

Comments

  • Inês Martins
    Inês Martins over 1 year

    I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...

    here my settings.py:

    BOT_NAME = 'reviews'
    
    SPIDER_MODULES = ['reviews.spiders']
    NEWSPIDER_MODULE = 'reviews.spiders'
    DOWNLOAD_DELAY = 1
    RANDOMIZE_DOWNLOAD_DELAY = True
    
    DOWNLOADER_MIDDLEWARES = {
            'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
            'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
            'reviews.rotate_useragent.RotateUserAgentMiddleware' :400,
            'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 
            'reviews.middlewares.ProxyMiddleware': 100,
        }
    
    PROXIES = [{'ip_port': '168.63.249.35:80', 'user_pass': ''},
               {'ip_port': '162.17.98.242:8888', 'user_pass': ''},
               {'ip_port': '70.168.108.216:80', 'user_pass': ''},
               {'ip_port': '45.64.136.154:8080', 'user_pass': ''},
               {'ip_port': '149.5.36.153:8080', 'user_pass': ''},
               {'ip_port': '185.12.7.74:8080', 'user_pass': ''},
               {'ip_port': '150.129.130.180:8080', 'user_pass': ''},
               {'ip_port': '185.22.9.145:8080', 'user_pass': ''},
               {'ip_port': '200.20.168.135:80', 'user_pass': ''},
               {'ip_port': '177.55.64.38:8080', 'user_pass': ''},]
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'reviews (+http://www.yourdomain.com)'
    

    here my middlewares.py:

    import base64
    import random
    from settings import PROXIES
    
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = "http://%s" % proxy['ip_port']
                encoded_user_pass = base64.encodestring(proxy['user_pass'])
                request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
            else:
                request.meta['proxy'] = "http://%s" % proxy['ip_port']
    

    Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?

    my rotate_useragent.py:

    import random
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, user_agent=''):
            self.user_agent = user_agent
    
        def process_request(self, request, spider):
            ua = random.choice(self.user_agent_list)
            if ua:
                request.headers.setdefault('User-Agent', ua)
    
        #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
        #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
        user_agent_list = [\
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
           ]
    

    Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...

    I am asking this all because with this things my spiders started to throw errors like

    twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting http://www.example.com/browse/?start=884 took longer than 180.0 seconds.
    

    and

    Error downloading <GET http://www.example.com/article/2883892/x-review.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
    

    and

    Error downloading <GET http://www.example.com/browse/?start=6747>: TCP connection timed out: 110: Connection timed out.
    

    Thanks so much for your help and time.