Can I block search crawlers for every site on an Apache web server?

apache search web-crawler httpd.conf

23,961

Solution 1

Create a robots.txt file with the following contents:

User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).

Add the following to your httpd.conf file:

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.

That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.

(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)

Solution 2

Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?

Solution 3

To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.

Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.

Solution 4

Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt

View more solutions

23,961

Author by

Nick Messick

Updated on July 09, 2022

Comments

Nick Messick almost 2 years

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.

Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?

Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.
Khuram about 12 years

saved me a lot of time. Thnx.
Dan Bizdadea about 10 years

there is a problem with this approach, when you want to expose some APIs to different services that don't support HTTP Auth. In this case you'll have to disable it for that specific host, which can lead to a mess in time.
nicoX over 9 years

What is the Alias referring to? If I have several vhosts should I create an Alias for each?
jsdalton over 9 years

@nicoX: You do not need to create a separate Alias for each vhost. The one you create here will apply to all vhosts you create.
nicoX over 9 years

From the httpd.conf file: We have the LoadModule vhost_alias_module modules/mod_vhost_alias.so our DocumentRoot /var/www/html which is wrong as we are using /var/www/vhosts although that still works. We include our vhosts with Include with the path to its httpd-include.conf file. I included the robots.txt file for each vhosts in its root directory. And httpd.conf I have the Alias of the file to just one of my vhosts