Scraping in Python - Preventing IP ban

python selenium web-scraping screen-scraping

32,930

Solution 1

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

the built-in AutoThrottle extension:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

rotating user agents with scrapy-fake-useragent middleware:

Use a random User-Agent provided by fake-useragent every request

rotating IP addresses:
- Setting Scrapy proxy middleware to rotate on each request
- scrapy-proxies
you can also run it via local proxy & TOR:
- Scrapy: Run Using TOR and Multiple Agents

Solution 2

I had this problem too. I used urllib with tor in python3.

download and install tor browser
testing tor

open terminal and type:

curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>

if you see result it's worked.

Now we should test in python. Now run this code

import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#set socks5 proxy to use tor

socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())

if you see

Congratulations. This browser is configured to use Tor.

it worked in python too and this means you are using tor for web scraping.

Solution 3

You could use proxies.

You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.

You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.

def load_proxy(PROXY_HOST,PROXY_PORT):
        fp = webdriver.FirefoxProfile()
        fp.set_preference("network.proxy.type", 1)
        fp.set_preference("network.proxy.http",PROXY_HOST)
        fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
        fp.set_preference("general.useragent.override","whater_useragent")
        fp.update_preferences()
        return webdriver.Firefox(firefox_profile=fp)

32,930

Author by

RhymeGuy

Updated on July 09, 2022

Comments

RhymeGuy almost 2 years

I am using Python to scrape pages. Until now I didn't have any complicated issues.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
RhymeGuy over 8 years

Im not fan of Scrapy, but I might give it a try, although I'm not sure it will help me. I have used all of the things you recommend and was not able to pass the limit.
RhymeGuy over 8 years

Can you recommend proxy service which i might use?
alecxe over 8 years

@RhymeGuy it's just a general answer so that it may help others visiting the topic. In your case, I would say switching IPs via a proxy is the way to go. Thanks.
RhymeGuy over 8 years

Thanks, Service looks okay, but not so cheap. Im not even sure that money that I will give for proxy will cover value of information that I will gather. Will have to think again.
Parsa over 8 years

if the pages you are searching for are cached by google you could search for them in google and access the static version cached by the google crawler?
RhymeGuy over 8 years

Unfortunately site use login form and most of the pages cannot be accessed without login. Therefore Google cannot cache them. Seems that using proxy service is only reasonable option in this case.
Mobin Al Hassan about 4 years

How we can change IP using chrome web-driver selenium and python
Hanzhou Tang over 3 years

Just want to update, the tor browser is now listening to port 9150 instead of 9050.
Harsh Vardhan over 3 years

failed to connect to localhost.
muinh over 3 years

after all these actions I was still banned by IP address