Preventing robots from crawling specific part of a page

30,512

Solution 1

Here is the same answer I provided to noindex tag for google on Stack Overflow:

You can prevent Google from seeing portions of the page by putting those portions in iframes that are blocked by robots.txt.

robots.txt

Disallow: /nocrawl/

index.html

This text is crawlable, but the following is
text that search engines can't see:
<iframe src="/nocrawl/content.html" width="100%" height=300 scrolling=no>

/nocrawl/content.html

Search engines cannot see this text.

Instead of using using iframes, you could load the contents of the hidden file using AJAX. Here is an example that uses jquery ajax to do so:

his text is crawlable, but the following is 
text that search engines can't see:
<div id="hidden"></div>
<script>
    $.get(
        "/nocrawl/content.html",
        function(data){$('#hidden').html(data)},
    );
</script>

Solution 2

Another solution is to wrap the sig in a span or div with style set to display:none and then use Javascript to take that away so the text displays for browsers with Javascript on. Search engines know it's not going to be displayed so shouldn't index it.

This bit of HTML, CSS and javascript should do it:

HTML:

<span class="sig">signature goes here</span>

CSS:

.sig {
display:none;
}

javascript:

<script type="text/javascript"> 
$(document).ready(function()
  {
      $(".sig").show();
  }
</script>

You'll need to include a jquery library.

Solution 3

I had a similar problem, I solved it with css but it can be done with javascript and jquery too.

1 - I created a class that I will call "disallowed-for-crawlers" and place that class in everything that I did not want the Google bot to see, or place it inside a span with that class.

2 - In the main CSS of the page I will have something like

.disallowed-for-crawlers {
    display:none;
}

3- Create a CSS file called disallow.css and add that to the robots.txt to be disallowed to be crawled, so crawlers wont access that file, but add it as reference to your page after the main css.

4- In disallow.css I placed the code:

.disallowed-for-crawlers {
    display:block !important;
}

You can play with javascript or css. I just took advantage of the disallow and the css classes. :) hope it helps someone.

Solution 4

One way to do this is to use an image of text rather than plain text.

It is possible that Google will eventually be smart enough to read the text out of the image, so it might not be completely future-proof, but it should work well for at least a while from now.

There's a bunch of disadvantages to this approach. If a person is visually impaired, it's bad. If you want your content to adapt to mobile devices versus desktop computers, it's bad. (and so on)

But it is a method that currently (somewhat) works.

Solution 5

Marking every part of the webpage that contains a signature as being non-crawlable.

You can use the directive nosnippet:

Do not show a text snippet or video preview in the search results for this page. A static image thumbnail (if available) may still be visible, when it results in a better user experience. This applies to all forms of search results (at Google: web search, Google Images, Discover).

If you don't specify this directive, Google may generate a text snippet and video preview based on information found on the page.

E.g.:

<p>This text can be shown in a snippet
<span data-nosnippet>and this part would not be shown</span>.</p>

<div data-nosnippet>not in snippet</div>
<div data-nosnippet="true">also not in snippet</div>
<div data-nosnippet="false">also not in snippet</div>
<!-- all values are ignored -->

<div data-nosnippet>some text</html>
<!-- unclosed "div" will include all content afterwards -->

<mytag data-nosnippet>some text</mytag>
<!-- NOT VALID: not a span, div, or section -->

For an understanding of Google's hidden text policy, check the Google Guide Hidden text and links.

Share:
30,512

Related videos on Youtube

WebbyTheWebbor
Author by

WebbyTheWebbor

Updated on September 18, 2022

Comments

  • WebbyTheWebbor
    WebbyTheWebbor almost 2 years

    As a webmaster in charge of a tiny site that has a forum, I regularly receive complains from users that both the internal search engine and that external searches (like when using Google) are totally polluted by my users' signatures (they're using long signatures and that's part of the forum's experience because signatures makes a lot of sense in my forum).

    So basically I'm seeing two options as of now:

    1. Rendering the signature as a picture and when a user click on the "signature picture" it gets taken to a page that contains the real signature (with the links in the signature etc.) and that page is set as being non-crawlable by search engine spiders). This would consume some bandwidth and need some work (because I'd need an HTML renderer producing the picture etc.) but obviously it would solve the issue (there are tiny gotchas in that the signature wouldn't respect the font/color scheme of the users but my users are very creative with their signatures anyway, using custom fonts/colors/size etc. so it's not that much of an issue).

    2. Marking every part of the webpage that contains a signature as being non-crawlable.

    However I'm not sure about the later: is this something that can be done? Can you just mark specific parts of a webpage as being non-crawlable?

  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    +1 and I thought about it but wouldn't that be considered a form of "cloaking" by various spiders?
  • paulmorriss
    paulmorriss almost 13 years
  • paulmorriss
    paulmorriss almost 13 years
    I think it's quite neat :-)
  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    the problem is not so much Google showing user's sigs in their snippets as these specific pages getting that highly ranked in Google in the first place. The issue here is precisely that Google may think the sigs are relevant when they're actually not: I mean, that's exactly what my question is all about.
  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    @ʍǝɥʇɐɯ: serving different content depending on who is accessing the page is kinda frowned upon and may penalize you in search engine as far as I understand it. I much prefer paulmorris' JavaScript solution.
  • DisgruntledGoat
    DisgruntledGoat almost 13 years
    @Webby, I don't understand, why don't you want your pages ranking highly? Do you have some example pages and queries so we can see what you're talking about? And if Google is showing a sig in search results, then it is relevant for that search query, even if it's not relevant to the page itself.
  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    @ʍǝɥʇɐɯ: erf, if serving personalized content is the name of the game, so is JavaScript. Last I checked the Web overall didn't really work that well anymore without JavaScript installed (GMail, FaceBook, Google Docs, stack overflow, Google+ --yup I've got it already ;) -- etc.). I don't see no need to criticize paulmorris' solution based on the false premise that JavaScript being not available would be an issue.
  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    I can't give examples but I do want my site/forum to rank highly and it does so very nicely. The problem is that amongst the search results (which are all mostly for my site/forum anyway because it's basically the site on the subject), what should be the real entry pages are flooded amongst signatures. I mean, I do really want to do what I asked in the question. And pictures or JavaScript it is going to be.
  • WebbyTheWebbor
    WebbyTheWebbor almost 13 years
    @ʍǝɥʇɐɯ: You may like this from Matt Cutts (in charge of SEO at Google) on that very subject: theseonewsblog.com/3383/google-hidden-text That was the excellent comment by paulmorris posted in comment to his excellent answer. I'm sorry but calling JavaScript "sillyness" on such a forum is close to trolling.
  • ʍǝɥʇɐɯ
    ʍǝɥʇɐɯ almost 13 years
    ...and then we get this question: webmasters.stackexchange.com/questions/16398/… - 'keyword stuffing' is silly. Sorry about that.
  • DisgruntledGoat
    DisgruntledGoat almost 13 years
    @Webby, your responses have been a little confusing but you seem to be implying that your user signatures are all separate pages (URLs) and thus appearing as separate results in SERPs. In which case you can block those pages through robots.txt. Otherwise, try the meta description solution I posted above, because that will almost certainly mitigate the problem.
  • InanisAtheos
    InanisAtheos over 11 years
    This could, in the strictest definition, be considered cloaking. However he could print all of the signature with javascript using a document.write("");. Google does not index anything within javascript. support.google.com/customsearch/bin/…
  • wrygiel
    wrygiel almost 11 years
    I believe Google could index such paragraphs, even if they are hidden using CSS. The safest option is to not include the text in the HTML at all. (We can use JavaScript to inject the text at runtime.)
  • Jayen
    Jayen over 8 years
    how well does this work if you use alt & title tage appropriately?
  • Stephen Ostermiller
    Stephen Ostermiller over 8 years
    No. Googleoff and Googleon are only supported by the Google Search Appliance. Googlebot ignores them for web search. Reference: Can you use googleon and googleoff comments to prevent Googlebot from indexing part of a page? You linked to the Google Search Appliance documentation and a comment on the article you linked to also says that it doesn't work for Googlebot.
  • Luke Madhanga
    Luke Madhanga over 8 years
    @StephenOstermiller oh right! Darn
  • James
    James over 8 years
    Haven't tried, but it seems likely that Google would crawl those. It's a major limitation of this approach.
  • Cristol.GdM
    Cristol.GdM over 7 years
    Google robots have become much better at reading JavaScript since 2011. Is this answer still valid? Wouldn't the current robots be able to read the text?
  • John Conde
    John Conde about 7 years
    This answer assumes the websites uses or the developer knows PHP which may not be true. Also, it makes getting to the content difficult for users which is not a good thing.
  • Alfons Marklén
    Alfons Marklén about 7 years
    I can buy that not every one knows PHP but a captcha can be "what is the color of grass", even blind pepole know that.
  • Pranav Bilurkar
    Pranav Bilurkar almost 7 years
    Does adding/injecting control using AJAX will help to disallow and prevent from crawling the same?
  • Stephen Ostermiller
    Stephen Ostermiller almost 7 years
    As long as the location the AJAX is fetching from is blocked by robots.txt.
  • Pranav Bilurkar
    Pranav Bilurkar almost 7 years
    Will you please check this webmasters.stackexchange.com/questions/108169/… and suggest if any.
  • Pranav Bilurkar
    Pranav Bilurkar almost 7 years
    As long as the location the AJAX is fetching from is blocked by robots.txt - Please elaborate on this.
  • Stephen Ostermiller
    Stephen Ostermiller almost 7 years
    as in the above example robots.txt contains Disallow: /iframes/ which prevents Googlebot from fetching the AJAX.
  • Mac Convery
    Mac Convery over 6 years
    Google penalises those who hide their javascript from being crawled, in order to prevent abuse. Is the same true of iframes?
  • Stephen Ostermiller
    Stephen Ostermiller over 6 years
    @Jonathan that is a good question, you should ask it using the Ask Question link.
  • Σπύρος Γούλας
    Σπύρος Γούλας over 5 years
    I believe this falls under "cloaking" and therefore it is not a good practice.
  • Σπύρος Γούλας
    Σπύρος Γούλας over 5 years
    I am not sure this works due to crawlers not accessing the .css file (is this a thing? Since when do crawlers access and crawl specific css files?) and not simply due to display:none and crawlers understanding it will not be displayed so they don't index it. Even if this is the case, what do you do to actually display the content to human users?
  • Rolando Retana
    Rolando Retana over 5 years
    The content is displayed when the step 4 is loaded for the human user since they are allowed to see that file(disallow.css). And about the robots loading CSS that is what respectable search engines do nowadays, that's how they determine when a website is mobile friendly or not, crawlers that do not respect it are not worth to worry about, major search engines read css and javascript to crawl pages, they been doin it for about... almost 6 years now? maybe more.
  • Σπύρος Γούλας
    Σπύρος Γούλας over 5 years
    Can you provide sources that back up that claim? Please see webmasters.stackexchange.com/questions/71546/… and yoast.com/dont-block-css-and-js-files and most importantly here webmasters.googleblog.com/2014/10/… where what you describe is portrayed as bad practice.
  • Rolando Retana
    Rolando Retana over 5 years
    It is a bad practice if I wanted Google to see my website normally and I block all the CSS, and is bad practice because they interpret the CSS, but in this specific case I block one specific file, not all of the css, OP ask about preventing google from reading a section of the page. but I don't want Google to crawl those sections so I block one single CSS ( not all of them, just one). And to back up what claim you said? the one that crawlers read JS and CSS? it is as easy as going to your Google Webmaster Tools and take a look at "Fetch as a robot" you will see there how they read css and js.
  • Rolando Retana
    Rolando Retana over 5 years
    Also to add, in my specific case is not that I want to do something shady with the Google Crawler, I just don't want google to read a section of information that may seem repetitive in all pages. Like Phone numbers, addresses, related products or information that is not relevant for Google to crawl.
  • Rolando Retana
    Rolando Retana over 5 years
  • Admin
    Admin over 5 years
    @StephenOstermiller I gladly gave you the bounty --- so Google never crawls data in iframe tags?
  • Stephen Ostermiller
    Stephen Ostermiller over 5 years
    It will crawl them unless you block that crawling using robots.txt
  • Maximillian Laumeister
    Maximillian Laumeister about 5 years
    @Cristol.GdM Correct. This answer is no longer relevant, as search engines like Google now execute JavaScript on the page as part of the indexation process.
  • Stephen Ostermiller
    Stephen Ostermiller over 3 years
    This looks useful. It should be noted that it doesn't prevent the text from getting indexed, it just prevents Google from showing it in the snippet in the search results.