Baiduspider is crawling my site even when forbidden by robots.txt, how do I prevent it?
Solution 1
You can try blocking specific IP addresses in your .htaccess file. You can find the ranges here.
In robots.txt you can also add the following
User-agent: Baiduspider
User-agent: baiduspider
User-agent: Baiduspider+
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /
Also, if you use caching plugins or CDN, make sure to clear all your cache.
Solution 2
I think the problem with your rewrite rule is the OR
flag. That flag usually means that there is a second rewrite condition coming. You only have one condition.
Here is a site that provides a similar rule for blocking BaiduSpider with slightly different syntax:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]
Related videos on Youtube
yuli chika
Updated on September 18, 2022Comments
-
yuli chika over 1 year
My site has heavy traffic because some bot. I checked access_log, some bot Baiduspider go to my site 10-20 times per minute. I do not need Chinese traffic. I have searched and read http://www.baidu.com/search/robots_english.html
I added rule into the robots.txt then restarted Aache, but it doesn't work. Baiduspider still crawls my site.
User-agent: Baiduspider Disallow: / User-agent: * Disallow: /feed/ Disallow: /trackback/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /xmlrpc.php
I found their feedback page http://zhanzhang.baidu.com/feedback/index I can translate the page to my language, but I cannot translate and insert captcha.
Then I have searched and find some article: http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html But when I add it into
.htaccess
, I cannot access my site,(you do not have permission to access this site) Am i inserted in a wrong position? need a help.# BEGIN WordPress <IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC,OR] RewriteRule ^.* - [F,L] #some custom rewrite rule RewriteRule ^article/([^/\.]+)/?$ /article/$1.php [L,QSA] RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] </IfModule>
BTW, my server is CentOS7 apache 2.4.6. I also tried "httpd.conf", but I never find any article about apache 2.4.6
<IfModule setenvif_module>
, all the articles are<IfModule mod_setenvif_c>
... apache 2.4.6 do cancelorder allow,deny
rule, I have no idea how to modify and add into myhttpd.conf
.Anyway, I just want to refuse
Baiduspider
Thanks.-
closetnoc over 9 yearsBaidu is often fairly well behaved. It is possible that since Baidu is a Japanese/Chinese search engine mostly from China, that some scrapers are using the agent name and going rogue. This may be what you are seeing. Otherwise, this is something I need to look into further.
-