How can I use robots.txt to disallow subdomain only?

29,536

Solution 1

You can serve a different robots.txt file based on the subdomain through which the site has been accessed. One way of doing this on Apache is by internally rewriting the URL using mod_rewrite in .htaccess. Something like:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^(www\.)?example\.com$ [NC]
RewriteRule ^robots\.txt$ robots-disallow.txt [L]

The above states that for all requests to robots.txt where the host is anything other than www.example.com or example.com, then internally rewrite the request to robots-disallow.txt. And robots-disallow.txt will then contain the Disallow: / directive.

If you have other directives in your .htaccess file then this directive will need to be nearer the top, before any routing directives.

Solution 2

robots.txt works only if it is present in the root.

You need to upload a separate robots.txt for each subdomain website, where it can be accessed from http://subdomain.example.com/robots.txt.

Add the code below in to robots.txt

User-agent: *
Disallow: /

And another way is you can insert a Robots <META> tag in all pages.

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Share:
29,536

Related videos on Youtube

alexus
Author by

alexus

Updated on September 18, 2022

Comments

  • alexus
    alexus over 1 year

    My code base is shared between several environments (live, staging, dev) & sub-domains (staging.example, dev.example, etc.) and only two should be allowed to be crawled (ie. www.example and example). Normally I'd modify /robots.txt and add Disallow: /, but due to shared code base I cannot modify /robots.txt without affecting all (sub)domains.

    Any ideas how to go about it?

  • alexus
    alexus over 7 years
    I was thinking of same solution. I wasn't sure if there is something else out there, but at the end of the day, if that is it, then that what will gets my job done)
  • MrWhite
    MrWhite over 7 years
    If both subdomains/hosts point to the very same webspace/code base then there's nothing in the robots.txt "standard" that can control this, if that is what you are suggesting. The bot is simply going to request sub.example.com/robots.txt, so you would need to do something to serve a different response depending on the subdomain. You don't need to use mod_rewrite, but it is a technique I've seen used several times. If robots.txt is dynamically generated then you could change the response in the server-side code (eg. PHP).
  • MrWhite
    MrWhite over 7 years
    An alternative to using robots.txt might be to prevent indexing, rather than crawling, by sending an X-Robots-Tag: noindex HTTP response header when such subdomains are accessed (which could also be done in .htaccess). Although I think preventing crawling is probably preferable. (?)
  • MrWhite
    MrWhite over 7 years
    But the OP already states: "Normally I'd modify /robots.txt and add Disallow: /, but due to shared code base I cannot modify /robots.txt without affecting all (sub)domains."
  • MrWhite
    MrWhite about 6 years
    Although this does not prevent crawling, which would seem to be the OPs requirement.