Hide a site entirely from search engines (Google, Bing etc.)

apache search-engines web-crawlers robots.txt

6,342

Solution 1

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow
Deny from all
Allow from x.x.x.x
Allow from y.y.y.y

The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Update considering your comment

You can use the [list of user agent string from crawlers] to block them on your .htaccess., this simple syntax would do what you want.

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|yahoo|yandex) [NC]
RewriteRule .* - [R=403,L]

Just add the most common ones or the ones that have been to your site.

Solution 2

Use Header set X-Robots-Tag "noindex". This prevents pages from being in a search engine's index.

In Apache you could put this in your conf file or .htaccess file in your root directory:

Header set X-Robots-Tag "noindex"

Solution 3

This happens when Google or Bing discovers your site and has not been told not to index the site. This happens when there is a link or redirect to the site and the robots.txt restricts the search engine from the site. However, this is not the same as telling a search engine not to index the site.

Put <meta name="robots" content="noindex"> in the header of your HTML of all pages (preferable) or at least the home page and search engines should remove your site from the index in time. It can take 30-60 normally (for Google) but may take longer. It all depends upon how fast the search engine revisits your site and the processing within the search engine. It can take less than 30 days too. I just wanted to warn you that it may take some time.

For now, there is no harm except that others may discover your site. If you want to limit visitation, then perhaps another mechanism is needed. I understand wanting to keep it open and not require an account. As of right now, I am not sure I have advice on limiting visitation. But also understand that rogue spiders will also discover your site and may create links regardless of your wishes. Think about how you may control access if and when this happens - and if control is important to you.

6,342

Author by

Kristian

Hello world.

Updated on September 18, 2022

Comments

Kristian over 1 year
My company is running a few internal websites that we do not want indexed by search engines such as Google, Bing etc.

However, the websites still need to be accessible for our customers, and therefore, I do not wish to use HTTP password protection.

Obviously, I already have a robots.txt containing:
```
User-agent: *
Disallow: /
```
When I search for the domain name, it still shows up, and Google says: "A description for this result is not available because of this site's robots.txt", while Bing says "We would like to show you a description here but the site won’t allow us.".

How can I ensure that the websites are totally hidden in the search results?
- Zistoloen about 10 years
  
  possible duplicate of How to stop certain urls from being indexed
- MrWhite about 10 years
  
  "Obviously, I already have a robots.txt" - it's actually the robots.txt file which is responsible for the message in SERPs. This should be removed to allow the pages to be crawled, but follow the advice given in the answers to prevent the pages from being indexed and showing in the SERPs at all.
MrWhite about 10 years

In addition... the Disallow entry in robots.txt should be removed.
Kristian about 10 years

I am using this, which should have the effect that I want: ` Header set X-Robots-Tag "noindex, nofollow"` (according to Google's documentation). It may take Google a while to re-crawl my site, but I will check back and award points to the working solution.
Kristian about 10 years

The website needs to be available for users from the entire world, but should just not be possible to find using search engines. We are using password-protection, just not HTTP password protection.
PatomaS about 10 years

@Kristian: considering your comment, I added one more option.