Can I block search crawlers for every site on an Apache web server?
Solution 1
Create a robots.txt file with the following contents:
User-agent: *
Disallow: /
Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt
).
Add the following to your httpd.conf file:
# Exclude all robots
<Location "/robots.txt">
SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt
The SetHandler
directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.
That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.
(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)
Solution 2
Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?
Solution 3
To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.
Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.
Solution 4
Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)
If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.
Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt
Nick Messick
Updated on July 09, 2022Comments
-
Nick Messick almost 2 years
I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.
Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?
Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.
-
Khuram about 12 yearssaved me a lot of time. Thnx.
-
Dan Bizdadea about 10 yearsthere is a problem with this approach, when you want to expose some APIs to different services that don't support HTTP Auth. In this case you'll have to disable it for that specific host, which can lead to a mess in time.
-
nicoX over 9 yearsWhat is the
Alias
referring to? If I have several vhosts should I create anAlias
for each? -
jsdalton over 9 years@nicoX: You do not need to create a separate
Alias
for each vhost. The one you create here will apply to all vhosts you create. -
nicoX over 9 yearsFrom the
httpd.conf
file: We have theLoadModule vhost_alias_module modules/mod_vhost_alias.so
ourDocumentRoot /var/www/html
which is wrong as we are using/var/www/vhosts
although that still works. We include our vhosts withInclude
with the path to itshttpd-include.conf
file. I included therobots.txt
file for each vhosts in its root directory. Andhttpd.conf
I have the Alias of the file to just one of my vhosts