How to set Robots.txt or Apache to allow crawlers only at certain hours?

apache web-crawler robots.txt iptables

10,122

Solution 1

You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).

Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.

If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.

Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.

Solution 2

You cannot determine what time the crawlers do their work, however with Crawl-delay you may be able to reduce the frequency in which they request pages. This can be useful to prevent them from rapidly requesting pages.

For Example:

User-agent: *
Crawl-delay: 5

Solution 3

This is not possible using some robots.txt syntax - the feature simply isn't there.

You might be able to influence crawlers by actually altering the robots.txt file depending on the time of day. I expect Google will check the file immediately before crawling, for example. But obviously, there is the huge danger of scaring crawlers away for good that way - the risk of that being probably more problematic than whatever load you get right now.

10,122

Author by

Joel Box

CIO/CTO of Mondial-IT. Lead developer and architect of online applications, Blockchain tech, community systems and eCommerce systems and a big fan of Drupal, Ethereum, JQuery, NodeJS. Expert in advertising technologies and energy transition (EV charging technologies), Magento and eCommerce. Handson developer with passion and drive to get the max out of technologies.

Updated on June 13, 2022

Comments

Joel Box almost 2 years
As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.

Is there a method to achieve this?

edit: thanks for all the good advice.

This is another solution we found.

2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.

the article

the setting of IPTables:
- Using connlimit
In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:

iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT

This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously. *
Pekka over 13 years

Nice, +1! Reference here: en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directive
Joel Box over 13 years

it is not as so much planning the spider, it's more allowing them access at that time or not. The spider will always be back..
Joel Box over 13 years

thx, i was aware of that one. Unfortunately there is no directive to regulated the number of crawlers. Last time round, we had 12 crawlers hitting the site at the same time.
John Mueller over 13 years

FWIW Google does not support crawl-delay -- there are just too many bogus values specified there that do not make sense. If you want to adjust the crawl rate for Googlebot, you can do that in Google Webmaster Tools .
Joel Box about 13 years

The 503 might indeed be the way forward for dynamic sites.
dave over 5 years

429 (Too Many Requests) would be the correct response code, ideally returned with a retry-after header -- despite the fact that Google -unforgivably- treat it the same as a 503, we might hope that better search engines are more standards-compliant.
dave over 5 years

Frustratingly, Google no longer respect this standards-based directive: Googlebot completely ignores any crawl-delay set in robots.txt.