Serve a different robots.txt file for every site hosted in the same directory

google redirects web-crawlers robots.txt

5,176

Solution 1

I wouldn't count on all spiders being able to follow a redirect to get to a robots.txt file. See: Does Google respect a redirect header for robots.txt to a different file name?

Assuming you are hosted on an Apache server, you could use mod_rewrite from your .htaccess file to to serve the correct file for the correct domain:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.example\.([a-z\.]+)$
RewriteRule ^robots.txt /%1/robots.txt [L]

In that case your robots.txt file for your .cl domain would be in /cl/robots.txt and your .com.au robots.txt file would be in /com.au/robots.txt

Solution 2

While this should work, it has a few potential drawbacks:

Every crawler has to do two HTTP requests: one to discover the redirect, and another one to actually fetch the file.
Some crawlers might not handle the 301 response for robots.txt correctly; there's nothing in the original robots.txt specification that says anything about redirects, so presumably they should be treated the same way as for ordinary web pages (i.e. followed), but there's no guarantee that all the countless robots that might want to crawl your site will get that right.

(The 1997 Internet Draft does explicitly say that "[o]n server response indicating Redirection (HTTP Status Code 3XX) a robot should follow the redirects until a resource can be found", but since that was never turned into an official standard, there's no real requirement for any crawlers to actually follow it.)

Generally, it would be better to simply configure your web server to return different content for robots.txt depending on the domain it's requested for. For example, using Apache mod_rewrite, you could internally rewrite robots.txt to a domain-specific file like this:

RewriteEngine On
RewriteBase /

RewriteCond %{HTTP_HOST} ^(www\.)?domain(\.com?)?\.([a-z][a-z])$
RewriteCond robots_%3.txt -f
RewriteRule ^robots\.txt$ robots_%3.txt [NS]

This code, placed in an .htaccess file in the shared document root of the sites, should rewrite any requests for e.g. www.domain.com.ar/robots.txt to the file robots_ar.txt, provided that it exists (that's what the second RewriteCond checks). If the file does not exist, or if the host name doesn't match the regexp, the standard robots.txt file is served by default.

(The host name regexp should be flexible enough to also match URLs without the www. prefix, and to also accept the 2LD co. instead of com. (as in domain.co.uk) or even just a plain ccTLD after domain; if necessary, you can tweak it to accept even more cases. Note that I have not tested this code, so it could have bugs / typos.)

Another possibility would be to internally rewrite requests for robots.txt to (e.g.) a PHP script, which can then generate the content of the file dynamically based on the host name and anything else you want. With mod_rewrite, this could be accomplished simply with:

RewriteEngine On
RewriteBase /

RewriteRule ^robots\.txt$ robots.php [NS]

(Writing the actual robots.php script is left as an exercise.)

5,176

Edgar Quintero

Preferred Languages = JavaScipt/TypeScript and Python Preferred Frameworks = Angular and Flask Preferred Area = Frontend

Updated on September 18, 2022

Comments

Edgar Quintero over 1 year
We have a global brand website project for which we are only working the LATAM portion. There is a website installation process here that allows to have one website installation with several ccTLDs, in order to reduce costs.

Because of this the robots.txt in www.domain.com/robots.txt is the same file in www.domain.com.ar/robots.txt.

We would like to implement custom robots.txt files for each LATAM country locale (AR, CO, CL, etc..). One solution we are thinking about is having a redirect placed at www.domain.com.ar/robots.txt to 301 to www.domain.com.ar/directory/robots.txt.

This way we could have custom robots.txt files for each country locale.
1. Does this make sense?
2. Is it possible to redirect a robots.txt file to another robots.txt file?
3. Any other suggestions?
Thanks in advance for any input you might have.
- Akash Panda about 10 years
  
  Not sure on the implications this could have however could you not use an internal rewrite for this?
- Edgar Quintero about 10 years
  
  Hi Liam, can you please explain what you mean by internal rewrite?
- Akash Panda about 10 years
  
  The below answers do use rewrites using htaccess so instead of going over the same explanation as the other people might be better to use there answers
MrWhite about 10 years

FWIW Google should be OK with the redirect. developers.google.com/webmasters/control-crawl-index/docs/…
Edgar Quintero about 10 years

Yea this is exactly the idea placed in my initial question, thanks for answering!
Edgar Quintero about 10 years

A file cannot be named robot_ar.txt!
joosthoek about 10 years

@EdgarQuintero: Why on earth could it not be?
Edgar Quintero about 10 years

Crawlers will always look for the file named as is - robots.txt
joosthoek about 10 years

@EdgarQuintero: An internal rewrite, as implemented by the rewrite rules I show above, happens entirely within the webserver. A crawler requesting the URL path /robots.txt has no way of even knowing whether the content it receives comes from a file named robots.txt (as usual) or from a file named robots_ar.txt (to which the request was rewritten) or even from a script named robots.php (or even whatever.php).