Scraping in Python - Preventing IP ban
Solution 1
If you would switch to the Scrapy
web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:
- the built-in AutoThrottle extension:
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
-
rotating user agents with
scrapy-fake-useragent
middleware:
Use a random User-Agent provided by fake-useragent every request
-
rotating IP addresses:
-
you can also run it via local proxy & TOR:
Solution 2
I had this problem too. I used urllib
with tor
in python3
.
- download and install tor browser
- testing tor
open terminal and type:
curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>
if you see result it's worked.
- Now we should test in python. Now run this code
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#set socks5 proxy to use tor
socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())
if you see
Congratulations. This browser is configured to use Tor.
it worked in python too and this means you are using tor for web scraping.
Solution 3
You could use proxies.
You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.
You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.
def load_proxy(PROXY_HOST,PROXY_PORT):
fp = webdriver.FirefoxProfile()
fp.set_preference("network.proxy.type", 1)
fp.set_preference("network.proxy.http",PROXY_HOST)
fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
fp.set_preference("general.useragent.override","whater_useragent")
fp.update_preferences()
return webdriver.Firefox(firefox_profile=fp)
RhymeGuy
Updated on July 09, 2022Comments
-
RhymeGuy almost 2 years
I am using
Python
to scrape pages. Until now I didn't have any complicated issues.The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.
Using
Requests
andlxml
I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make
time.sleep(300)
call on every 300th request). Despite that, Im getting banned.From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.
Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
-
RhymeGuy over 8 yearsIm not fan of Scrapy, but I might give it a try, although I'm not sure it will help me. I have used all of the things you recommend and was not able to pass the limit.
-
RhymeGuy over 8 yearsCan you recommend proxy service which i might use?
-
alecxe over 8 years@RhymeGuy it's just a general answer so that it may help others visiting the topic. In your case, I would say switching IPs via a proxy is the way to go. Thanks.
-
RhymeGuy over 8 yearsThanks, Service looks okay, but not so cheap. Im not even sure that money that I will give for proxy will cover value of information that I will gather. Will have to think again.
-
Parsa over 8 yearsif the pages you are searching for are cached by google you could search for them in google and access the static version cached by the google crawler?
-
RhymeGuy over 8 yearsUnfortunately site use login form and most of the pages cannot be accessed without login. Therefore Google cannot cache them. Seems that using proxy service is only reasonable option in this case.
-
Mobin Al Hassan about 4 yearsHow we can change IP using chrome web-driver selenium and python
-
Hanzhou Tang over 3 yearsJust want to update, the tor browser is now listening to port 9150 instead of 9050.
-
Harsh Vardhan over 3 yearsfailed to connect to localhost.
-
muinh over 3 yearsafter all these actions I was still banned by IP address