Is it possible to slow the Baiduspider crawl frequency?

5,055

Solution 1

Great question, and one many webmasters might be interested in since the Baidu spider is notoriously aggressive and can zap resources from servers...

As indicated in Baidu's Web Search news, the Baidu spider does not support the Crawl-delay notification setting, and instead requires you to register and verify your site with its Baidu Webmaster Tools platform, as stated here on its site. This appears to be the only option to control the crawling frequency directly with Baidu.

The problem is that other spam bots use Baidu's user-agents (listed here under number 2) to spider your site, as indicated in their FAQ's here under number 4. So requesting a slower crawl rate with Baidu may not solve everything.

Therefore, if you do decide to use Baidu's Webmaster Tools, it might be wise to also compare its user-agents with IP's known to be associated with them by using a resource like the Bots vs Browsers Database, or using a reverse DNS lookup

The only other options are to either block all Baidu user-agents, and thus sacrifice potential traffic from Baidu, or attempt to limit excessive requests using something like mod_qos for Apache, which claims to manage:

  • The maximum number of concurrent requests to a location/resource (URL) or virtual host.
  • Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
  • Limits the number of request events per second (special request conditions).
  • It can also "detect" very important persons (VIP) which may access the web server without or with fewer restrictions.
  • Generic request line and header filter to deny unauthorized operations. Request body data limitation and filtering (requires mod_parp).
  • Limitations on the TCP connection level, e.g., the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
  • Prefers known IP addresses when server runs out of free TCP connections.

I haven't found reported experiences with Baidu Webmaster Tools, which is slow to load and has translation issues (no English version either). That might be helpful, but opinion-based of course.

Solution 2

After a lot of research and experimentation with this, I finally bit the bullet and set up a Baidu Webmaster Tools account. Its quite straightforward to use when armed with Google Translate in another window. You may need to have firebug activated in order to be able to copy-and-paste Chinese text from buttons that you cannot capture from the normal browser mode.

After you have setup, you need to wait a few days for crawling data to appear and then you can customize the crawl rate. It appears in a section called "Pressure" which you should be able to get to with this URL:
http://zhanzhang.baidu.com/pressure/adjust?site=http%3A%2F%2Fwww.yourURL.com%2F
Note that you will only be able to use this URL if you have a Baidu Webmaster Tools account setup and you have associated your website URL with your account for the website in question). Here you will see a slider with your current crawl rate in the center (in my case 12676 requests per day. Slide it to the left in order to reduce the crawl rate.

I have no idea yet if it actually respects your request. It gives you a warning which says something like this. "We recommend that you use the default site Baidu crawl rate. Only if your website has problems with our crawling then use this tool to adjust it. To maintain normal crawling of your site, Baidu will take your adjustment of crawl rate into account with actual site conditions and therefore can not guarantee to adjust according to your request."

Share:
5,055

Related videos on Youtube

samthebrand
Author by

samthebrand

Product @ Papa. Formerly a Stack Overflow employee.

Updated on September 18, 2022

Comments

  • samthebrand
    samthebrand over 1 year

    Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy."

    I've experienced this phenomenon at sites I work with. In at least one instance, I've found that Baiduspider crawls at about the same frequency as Googlebot, despite the fact that Baidu delivers about .1% as much traffic as Google.

    I'd like to keep those visits on my site, as few as they are (maybe one day they'll grow?), but I can't justify allowing such a heavy load on my server.

    The accepted answer to the question linked above suggests Baidu Webmaster Tools offers the opportunity to limit crawl rate, but I'm hesitate to open up that (Chinese-only) can of worms.

    Does anybody have any experience limiting Baiduspider crawl rate with BWT? Is there another way to limit this load?

  • samthebrand
    samthebrand almost 11 years
    Baiduspider doesn't support Crawl-Delay. See here.
  • Duarte Patrício
    Duarte Patrício almost 11 years
    Whoops, had seen it in a few sites robots.txt file so assumed it did! How does that saying go?!
  • samthebrand
    samthebrand almost 11 years
    This is really helpful @Dan. Trying out a few of these solutions (Baidu Webmaster Tools is a real pain.) Will report back.
  • dan
    dan almost 11 years
    Thanks! Great - I'll update this if I find any other options too. This question reflects a lot of webmasters' frustrations with aggressive bots, and concerns with interacting with them (e.g., Baidu Webmaster Tools). Hopefully legitimate bots will take this into consideration, and better tools/options will become available.
  • lazysoundsystem
    lazysoundsystem almost 7 years
    I'm sure I'm not the only one who'd appreciate an update on this - does it respect the request? Would you advise creating an account?
  • lazysoundsystem
    lazysoundsystem almost 7 years
    @samthebrand and dan - please do report back! Have you found any other solutions you can recommend?
  • odony
    odony almost 7 years
    Just updated the direct URL to the crawl frequency adjustment page, as it has been more deeply buried in the Webmaster Tools now (not in the menu anymore). Google translate makes it very hard to find due to confusing translations ;-)