How to stop certain urls from being indexed

21,231

Solution 1

There are 2 main ways to prevent search engines from indexing specific pages:

  1. A Robots.txt file for your domain.
  2. The Meta Robots tag on each page.

Robots.txt should be your first stop for URL patterns that match several files. You can see the syntax here and more detailed here. The robots.txt file must be placed in the root folder of your domain, i.e. at http://www.yourdomain.com/robots.txt , and it would contain something like:

User-agent: *
Disallow: /path/with-trailing-slash/

(The text coloring above is done by the Stackexchange software, and should be ignored.)

The Meta Robots tag is more flexible and capable, but must be inserted in every page you want to affect.

Again Google has a overview of how to use Meta Robots, and how to get pages removed from their index via Webmaster Tools. Wikipedia has more comprehensive documentation on Meta Robots, including the search engine specific derivations.

If you want to prohibit Google, The Web Archive and other search engines from keeping a copy of your webpage, then you want the following tag (shown in HTML4 format):

<meta name="robots" content="noarchive">

To prevent indexing and keeping a copy:

<meta name="robots" content="noindex, noarchive">

And to prevent both of the above, as well as using links on the page to find more pages to index:

<meta name="robots" content="noindex, nofollow, noarchive">

NB 1: All 3 above meta tags are for search engines alone -- they do not impact HTTP proxies or browsers.

NB 2: If you already have pages indexed and archived, and you block pages via robots.txt while at the same time adding the meta tag to the same pages, then the robots.txt will prevent search engines from seeing the updated meta tag.

Solution 2

There's actually a third way to prevent Google and other search engines from indexing URLs. It's the X-Robots-Tag HTTP Response Header. This is better then meta tags because it works for all documents and you can have more then one tag.

The REP META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.

We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples: Don't display a cache link or snippet for this item in the Google search results: X-Robots-Tag: noarchive, nosnippet Don't include this document in the Google search results: X-Robots-Tag: noindex Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT: X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT

You can combine multiple directives in the same document. For example: Do not show a cached link for this document, and remove it from the index after 23rd July 2007, 3pm PST: X-Robots-Tag: noarchive X-Robots-Tag: unavailable_after: 23 Jul 2007 15:00:00 PST

Solution 3

If your goal is for this pages to not be seen by the public, it's best to put a password on this set of pages. And/or have some configuration that only allows specific, whitelisted addresses able to access the site (this can be done at the server level, likely via your host or server admin).

If your goal is to have these pages exist, just not indexed by Google, or other search engines, as others have mentioned, you do have a few options, but I think it's important to distinguish between the two main functions of Google Search in this sense: Crawling and Indexing.

Crawling vs. Indexing

Google crawls your site, Google indexes your site. The crawlers find pages of your site, the indexing is organizing the pages of your site. More information on this a bit here.

This distinguishing is important when trying to block or remove pages from Google's "Index". Many people default to just blocking via robots.txt, which is a directive telling Google what (or what not) to crawl. It's often assumed that if Google doesn't crawl your site, it's unlikely to index it. However, it's extremely common to see pages blocked by robots.txt, indexed in Google.


Directives to Google & Search Engines

These type of "directives" are merely recommendations to Google on which part of your site to crawl, and index. They're not required to follow them. This is important to know. I've seen many devs over the years think that they can just block the site via robots.txt, and suddenly the site is being indexed in Google a few weeks later. If someone else links to the site, or if one of Google's crawlers somehow gets a hold of it, it can still be indexed.

Recently, with GSC (Google Search Console)'s updated dashboard, they have this report called the "Index Coverage Report." New data is available to webmasters here that's not been directly available before, specific details on how Google handles a certain set of pages. I've seen and heard of many websites receiving "Warnings," labeled "Indexed, but blocked by Robots.txt."

Google's latest documentation mentions that if you want pages out of the index, add noindex nofollow tags to it.


Remove URLs Tool

Just to build on what some others have mentioned about the "Remove URL's Tool"....

If the pages are indexed already, and it's urgent to get them out, Google's "Remove URLs Tool" will allow you to "temporarily" block pages from search results. The request lasts 90 days, but I've used it to have pages removed quicker from Google than using noindex, nofollow, kind of like an extra layer.

Using the "Remove URLs Tool," Google still will crawl the page, and possibly cache it, but while you're using this feature, you can add the noindex nofollow tags, so it sees them, and by the time the 90 days are up, it'll hopefully know not to index your page anymore.


IMPORTANT: Using both robots.txt and noindex nofollow tags are somewhat conflicting signals to Google.

The reason is, if you tell google not to crawl a page, and then you have noindex nofollow on that page, it may not crawl to see the noindex nofollow tag. It can then be indexed through some other method (whether a link, or whatnot). The details on why this happens are rather vague, but I've seen it happen.


In short, in my opinion, the best way to stop specific URLs from being indexed are to add a noindex nofollow tag to those pages. With that, make sure that you're not blocking those URLs also with robots.txt, as that could prevent Google from properly seeing those tags. You can leverage the Remove URLs from Google tool to temporarily hide them from search results while Google processes your noindex nofollow.

Solution 4

Yes, that will fix the problem. To prevent content from showing up in Googles indexes you can either use robots.txt or the html meta tag

<meta name="robots" content="noindex, nofollow" />

The next time your site is indexed this will make your content drop out of the Google index.

You can also you the noarchive value – this will block caching of your page. This is Google specific:

<meta name="robots" content="noarchive" />

You can use the ‘removal tool’ in Googles Webmaster Tools to request a very urgent removal of your content. Note that you should block indexing of your content first (using either robots.txt or the meta robots tag).

More info:

Share:
21,231

Related videos on Youtube

Simon Hayter
Author by

Simon Hayter

Updated on September 17, 2022

Comments

  • Simon Hayter
    Simon Hayter over 1 year

    When I type site:example.com(using my domain obviously), I get several link errors showing up in the listing. Typically, they are of the form: /some/fixed/path/admin/unblockUser/11

    I am thinking of adding the following line to my robots.txt file:

    Disallow: /some/fixed/path/admin/*
    
  • Jesper M
    Jesper M over 13 years
    Downvoted? Why on earth was this downvoted? Please leave a comment if you down-vote so the answer can be improved.
  • mawtex
    mawtex over 13 years
    @Jesper Mortensen Your initial answer did not address the caching question at all. Your edit fixed this and made the noindex info much better. +1 now ;-)
  • mawtex
    mawtex over 13 years
    The 'X-Robots_tag header' link is broken.
  • John Conde
    John Conde over 13 years
    Thanks for the heads up. Chrome seems to have issues with the formatting toolbar and it added extra text to the link.
  • John Mueller
    John Mueller over 13 years
    Also, while not necessarily incorrect, a robots meta tag with "noindex, noarchive" is equivalent to "noindex" (when a URL is not indexed, it's not archived/cached either).
  • John Mueller
    John Mueller over 13 years
    Finally (sorry for adding so many comments :-)), in this particular case (admin-pages), I would just make sure that the URLs return 403 when not logged in. That also prevents search engines from indexing it and is theoretically clearer than having a page return 200 + using a noindex robots meta tag. The end result is the same in the search results, but using the proper HTTP result code can help you to recognize unauthorized admin-accesses in your logs easier.
  • Jesper M
    Jesper M over 13 years
    @John Mueller: Good and important points all. Just to clarify "noindex, noarchive" may be equivalent to "noindex" with current SE's, but semantically they're different -- The Web Archive could be a use case where "noarchive" does not follow "noindex". Great point about using HTTP's semantics, real access control (HTTPS, login required) and possibly also cache-control: no-store to truly exclude pages from search engines and HTTP proxies. Actually, I didn't fully see the "admin pages" part in OPs question -- wonder why this info is public at all.