How can I use robots.txt to disallow subdomain only?
Solution 1
You can serve a different robots.txt
file based on the subdomain through which the site has been accessed. One way of doing this on Apache is by internally rewriting the URL using mod_rewrite in .htaccess. Something like:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^(www\.)?example\.com$ [NC]
RewriteRule ^robots\.txt$ robots-disallow.txt [L]
The above states that for all requests to robots.txt
where the host is anything other than www.example.com
or example.com
, then internally rewrite the request to robots-disallow.txt
. And robots-disallow.txt
will then contain the Disallow: /
directive.
If you have other directives in your .htaccess file then this directive will need to be nearer the top, before any routing directives.
Solution 2
robots.txt
works only if it is present in the root.
You need to upload a separate robots.txt
for each subdomain website, where it can be accessed from http://subdomain.example.com/robots.txt
.
Add the code below in to robots.txt
User-agent: *
Disallow: /
And another way is you can insert a Robots <META>
tag in all pages.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Related videos on Youtube
alexus
Updated on September 18, 2022Comments
-
alexus over 1 year
My code base is shared between several environments (live, staging, dev) & sub-domains (
staging.example
,dev.example
, etc.) and only two should be allowed to be crawled (ie.www.example
andexample
). Normally I'd modify/robots.txt
and addDisallow: /
, but due to shared code base I cannot modify/robots.txt
without affecting all (sub)domains.Any ideas how to go about it?
-
alexus over 7 yearsI was thinking of same solution. I wasn't sure if there is something else out there, but at the end of the day, if that is it, then that what will gets my job done)
-
MrWhite over 7 yearsIf both subdomains/hosts point to the very same webspace/code base then there's nothing in the robots.txt "standard" that can control this, if that is what you are suggesting. The bot is simply going to request
sub.example.com/robots.txt
, so you would need to do something to serve a different response depending on the subdomain. You don't need to use mod_rewrite, but it is a technique I've seen used several times. Ifrobots.txt
is dynamically generated then you could change the response in the server-side code (eg. PHP). -
MrWhite over 7 yearsAn alternative to using
robots.txt
might be to prevent indexing, rather than crawling, by sending anX-Robots-Tag: noindex
HTTP response header when such subdomains are accessed (which could also be done in .htaccess). Although I think preventing crawling is probably preferable. (?) -
MrWhite over 7 yearsBut the OP already states: "Normally I'd modify
/robots.txt
and addDisallow: /
, but due to shared code base I cannot modify/robots.txt
without affecting all (sub)domains." -
MrWhite about 6 yearsAlthough this does not prevent crawling, which would seem to be the OPs requirement.