Baiduspider is crawling my site even when forbidden by robots.txt, how do I prevent it?

9,575

Solution 1

You can try blocking specific IP addresses in your .htaccess file. You can find the ranges here.

In robots.txt you can also add the following

User-agent: Baiduspider
User-agent: baiduspider
User-agent: Baiduspider+ 
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

Also, if you use caching plugins or CDN, make sure to clear all your cache.

Solution 2

I think the problem with your rewrite rule is the OR flag. That flag usually means that there is a second rewrite condition coming. You only have one condition.

Here is a site that provides a similar rule for blocking BaiduSpider with slightly different syntax:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]
Share:
9,575

Related videos on Youtube

yuli chika
Author by

yuli chika

Updated on September 18, 2022

Comments

  • yuli chika
    yuli chika over 1 year

    My site has heavy traffic because some bot. I checked access_log, some bot Baiduspider go to my site 10-20 times per minute. I do not need Chinese traffic. I have searched and read http://www.baidu.com/search/robots_english.html

    I added rule into the robots.txt then restarted Aache, but it doesn't work. Baiduspider still crawls my site.

    User-agent: Baiduspider
    Disallow: /
    
    User-agent: *
    Disallow: /feed/
    Disallow: /trackback/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /xmlrpc.php
    

    I found their feedback page http://zhanzhang.baidu.com/feedback/index I can translate the page to my language, but I cannot translate and insert captcha.

    Then I have searched and find some article: http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html But when I add it into .htaccess, I cannot access my site,(you do not have permission to access this site) Am i inserted in a wrong position? need a help.

    # BEGIN WordPress
    <IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteBase /
    
    RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC,OR]
    RewriteRule ^.* - [F,L]
    
    #some custom rewrite rule
    RewriteRule ^article/([^/\.]+)/?$ /article/$1.php [L,QSA]
    
    RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . /index.php [L]
    </IfModule>
    

    BTW, my server is CentOS7 apache 2.4.6. I also tried "httpd.conf", but I never find any article about apache 2.4.6 <IfModule setenvif_module>, all the articles are <IfModule mod_setenvif_c>... apache 2.4.6 do cancel order allow,deny rule, I have no idea how to modify and add into my httpd.conf.

    Anyway, I just want to refuse Baiduspider Thanks.

    • closetnoc
      closetnoc over 9 years
      Baidu is often fairly well behaved. It is possible that since Baidu is a Japanese/Chinese search engine mostly from China, that some scrapers are using the agent name and going rogue. This may be what you are seeing. Otherwise, this is something I need to look into further.