Scrapy using pool of random proxies to avoid being banned

http https proxy scrapy user-agent

10,117

Solution 1

There is already a library to do this. https://github.com/aivarsk/scrapy-proxies

Please download it from there. It has not been in pypi.org yet, so you can't install it easily using pip or easy_install.

Solution 2

There's not a correct answer for this. Some proxies are not always available so you have to check them now and then. Also, if you use the same proxy every time the server you are scraping may block its IP as well, but that depends on the security mechanisms this server has.
Yes, because you don't know if all the proxies you have in your pool support HTTPS. Or you could have just one pool and add a field to each proxy that indicates its HTTPS support.
In your settings your are disabling the user agent middleware: 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None. The USER_AGENT setting won't have any effect.

10,117

Inês Martins

Junior data scientist. Pylady.

Updated on September 18, 2022

Comments

Inês Martins over 1 year

I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...

here my settings.py:

BOT_NAME = 'reviews'

SPIDER_MODULES = ['reviews.spiders']
NEWSPIDER_MODULE = 'reviews.spiders'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True

DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'reviews.rotate_useragent.RotateUserAgentMiddleware' :400,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 
        'reviews.middlewares.ProxyMiddleware': 100,
    }

PROXIES = [{'ip_port': '168.63.249.35:80', 'user_pass': ''},
           {'ip_port': '162.17.98.242:8888', 'user_pass': ''},
           {'ip_port': '70.168.108.216:80', 'user_pass': ''},
           {'ip_port': '45.64.136.154:8080', 'user_pass': ''},
           {'ip_port': '149.5.36.153:8080', 'user_pass': ''},
           {'ip_port': '185.12.7.74:8080', 'user_pass': ''},
           {'ip_port': '150.129.130.180:8080', 'user_pass': ''},
           {'ip_port': '185.22.9.145:8080', 'user_pass': ''},
           {'ip_port': '200.20.168.135:80', 'user_pass': ''},
           {'ip_port': '177.55.64.38:8080', 'user_pass': ''},]

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)'

here my middlewares.py:

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?

my rotate_useragent.py:

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
       ]

Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...

I am asking this all because with this things my spiders started to throw errors like

twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting http://www.example.com/browse/?start=884 took longer than 180.0 seconds.

and

Error downloading <GET http://www.example.com/article/2883892/x-review.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]

and

Error downloading <GET http://www.example.com/browse/?start=6747>: TCP connection timed out: 110: Connection timed out.

Thanks so much for your help and time.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Connect to an https service using an http-only client

How to configure mitmproxy to intercept https?

Can I set up NGINX as a transparent SSL proxy?

What tool can I use to sniff HTTP/HTTPS traffic?

Pros and cons of using a Http proxy v/s https proxy?

How to configure a proxy server for both HTTP and HTTPS in Maven's settings.xml?

HTTP Proxy server in C++

handshakeexception handshake error in client (os error wrong_version_number tls_record cc 242) Flutter app/Node server

Is it possible to allow mixed content in Flutter WebView

SocketException: Failed host lookup: 'methods.abc.com' (OS Error: No address associated with hostname, errno = 7), StackTrace :