Proxy IP for Scrapy framework

12,814

Solution 1

Here are the options I'm currently using (depending on my needs):

  • proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
  • a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies

The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.

Solution 2

Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.

Disclaimer: I work for the mother company Scrapinghub, who also are core developers of Scrapy.

Share:
12,814

Related videos on Youtube

Binit Singh
Author by

Binit Singh

A results-oriented technical leader with 7+ years of experience in Python web development. I have worked as a leader, architect and developer with many Product companies to build systems that scale to millions of users. Right now I am beginning a new Journey in Artificial Intelligence.

Updated on September 15, 2022

Comments

  • Binit Singh
    Binit Singh over 1 year

    I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .

    Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this

    Now i am confused about the options i should chosse

    1. Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/

    2. Use TOR

    3. Use VPN Service like http://www.hotspotshield.com/

    4. Any Option better than above three

  • Spaceman
    Spaceman almost 10 years
    does Amazon allow changing public IPs often? Didn't find any info on that... I'd like to spin up 20 instances and rotate their public IPs often (probably every minute) using APIs
  • Ming
    Ming over 7 years
    @herrherr could you share more on how to implement your second option. any guides for us to lookup on. much appreciated. thanks :)
  • demisx
    demisx about 4 years
    It’s just too expensive for a single developer. Their plans start at $99/month.
  • Nabin
    Nabin almost 4 years
    If you don't want to always go and check for available free proxies, you can use this library github.com/nabinkhadka/scrapy-rotating-free-proxies. While running a spider, this library will automatically fetch fresh and newly available proxies.