Scraping in Python - Preventing IP ban

32,930

Solution 1

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

Use a random User-Agent provided by fake-useragent every request

Solution 2

I had this problem too. I used urllib with tor in python3.

  1. download and install tor browser
  2. testing tor

open terminal and type:

curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>

if you see result it's worked.

  1. Now we should test in python. Now run this code
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#set socks5 proxy to use tor

socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())

if you see

Congratulations. This browser is configured to use Tor.

it worked in python too and this means you are using tor for web scraping.

Solution 3

You could use proxies.

You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.

You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.

def load_proxy(PROXY_HOST,PROXY_PORT):
        fp = webdriver.FirefoxProfile()
        fp.set_preference("network.proxy.type", 1)
        fp.set_preference("network.proxy.http",PROXY_HOST)
        fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
        fp.set_preference("general.useragent.override","whater_useragent")
        fp.update_preferences()
        return webdriver.Firefox(firefox_profile=fp)
Share:
32,930
RhymeGuy
Author by

RhymeGuy

Updated on July 09, 2022

Comments

  • RhymeGuy
    RhymeGuy almost 2 years

    I am using Python to scrape pages. Until now I didn't have any complicated issues.

    The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

    Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

    I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

    From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

    Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

  • RhymeGuy
    RhymeGuy over 8 years
    Im not fan of Scrapy, but I might give it a try, although I'm not sure it will help me. I have used all of the things you recommend and was not able to pass the limit.
  • RhymeGuy
    RhymeGuy over 8 years
    Can you recommend proxy service which i might use?
  • alecxe
    alecxe over 8 years
    @RhymeGuy it's just a general answer so that it may help others visiting the topic. In your case, I would say switching IPs via a proxy is the way to go. Thanks.
  • RhymeGuy
    RhymeGuy over 8 years
    Thanks, Service looks okay, but not so cheap. Im not even sure that money that I will give for proxy will cover value of information that I will gather. Will have to think again.
  • Parsa
    Parsa over 8 years
    if the pages you are searching for are cached by google you could search for them in google and access the static version cached by the google crawler?
  • RhymeGuy
    RhymeGuy over 8 years
    Unfortunately site use login form and most of the pages cannot be accessed without login. Therefore Google cannot cache them. Seems that using proxy service is only reasonable option in this case.
  • Mobin Al Hassan
    Mobin Al Hassan about 4 years
    How we can change IP using chrome web-driver selenium and python
  • Hanzhou Tang
    Hanzhou Tang over 3 years
    Just want to update, the tor browser is now listening to port 9150 instead of 9050.
  • Harsh Vardhan
    Harsh Vardhan over 3 years
    failed to connect to localhost.
  • muinh
    muinh over 3 years
    after all these actions I was still banned by IP address