The Sitemap Paradox

42,325

Solution 1

Disclaimer: I work together with the Sitemaps team at Google, so I'm somewhat biased :-).

In addition to using Sitemaps extensively for "non-web-index" content (images, videos, News, etc.) we use information from URLs included in Sitemaps files for these main purposes:

  • Discovering new and updated content (I guess this is the obvious one, and yes, we do pick up and index otherwise unlinked URLs from there too)
  • Recognizing preferred URLs for canonicalization (there are other ways to handle canonicalization too)
  • Providing a useful indexed URL count in Google Webmaster Tools (approximations from site:-queries are not usable as a metric)
  • Providing a basis for useful crawl errors (if a URL included in a Sitemap file has a crawl error, that's usually a bigger issue & shown separately in Webmaster Tools)

On the webmaster-side, I've also found Sitemaps files extremely useful:

  • If you use a crawler to create the Sitemaps file, then you can easily check that your site is crawlable and see first-hand what kind of URLs are found. Is the crawler finding your preferred URLs, or is something incorrectly configured? Is the crawler getting stuck in infinite spaces (eg endless calendar scripts) somewhere? Is your server able to handle the load?
  • How many pages does your site really have? If your Sitemap file is "clean" (no duplicates, etc), then that's easy to check.
  • Is your site really cleanly crawlable without running into duplicate content? Compare the server logs left behind by Googlebot with your Sitemaps file -- if Googlebot is crawling URLs that aren't in your Sitemap file, you might want to double-check your internal linking.
  • Is your server running into problems with your preferred URLs? Cross-checking your server error log with the Sitemaps URLs can be quite useful.
  • How many of your pages are really indexed? As mentioned above, this count is visible in Webmaster Tools.

Granted, for really small, static, easily crawlable sites, using Sitemaps may be unnecessary from Google's point of view once the site has been crawled and indexed. For anything else, I'd really recommend using them.

FWIW There are some misconceptions that I'd like to cover as well:

  • The Sitemap file isn't meant to "fix" crawlability issues. If your site can't be crawled, fix that first.
  • We don't use Sitemap files for ranking.
  • Using a Sitemap file won't reduce our normal crawling of your site. It's additional information, not a replacement for crawling. Similarly, not having a URL in a Sitemap file doesn't mean that it won't be indexed.
  • Don't fuss over the meta-data. If you can't provide useful values (eg for priority), leave them out & don't worry about that.

Solution 2

If you know you have good site architecture and the Google would find your pages naturally the only benefit I'm aware of is faster indexing, if your site is getting indexed fast enough for you then no need.

Here's article from 2009 where a gentlemen tested how fast Google crawled his site with a sitemap and without. http://www.seomoz.org/blog/do-sitemaps-effect-crawlers

My rule of thumb is if you're launching something new and untested you want to see how Google crawls your site to make sure there is nothing that needs to be fixed so don't submit, however, if you're making changes and want Google to see them faster then do submit or if you have other time sensitive information such as breaking news then submit because you want to do whatever you can to make sure you're the first Google sees, otherwise it's a matter of preference.

Solution 3

I suspect: for Google, sitemaps are necessary to keep track of updates in the fastest way possible. E.g., let's say you have added a new content to some deep location of your web site, which takes more than 10-20 clicks to reach from your home page. For Google to reach this new page would be less likely in a short time - so instead, until a path to this page is utterly determined, the existence of it is announced. After all, PageRank is not calculated immediately, it requires time to evaluate user behavior and such - so, until then, why shouldn't the engine crawl and index a page with fresh content?

Solution 4

Sitemaps are incredibly valuable if you use them correctly.

First off, the fact that Google says they are hints is only there to a) ensure that webmasters aren't under the false impression that sitemap = indexation and b) give Google the ability to ignore certain sitemaps if they deem them to be unreliable (aka lastmod is the current date for all URLs each day they're accessed.)

However, Google generally likes and consumes sitemaps (in fact they'll sometimes find their own and add them to Google Webmaster Tools). Why? It increases the efficiency with which they can crawl.

Instead of starting at a seed site and crawling the web, they can allocate an appropriate amount of their crawl budget to a site based on the submitted sitemaps. They can also build up a large history of your site with associated error data (500, 404 etc.)

From Google:

"Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it."

What they don't say is that crawling the web is time consuming and they're prefer to have a cheat sheet (aka sitemap).

Sure, your site might be just fine from an crawl perspective, but if you want to introduce new content, dropping that content into a sitemap with a high priority is a quicker way to get crawled and indexed.

And this works for Google too, since they want to find, crawl and index new content - fast. Now, even if you don't think Google prefers the beaten path versus the machete on the jungle approach, there's another reason why sitemaps are valuable - tracking.

In particular, using a sitemap index (http://sitemaps.org/protocol.php#index) you can break your site down into sections - sitemap by sitemap. By doing so you can then look at the indexation rate of your site section by section.

One section or content type might have an 87% indexation rate while another could have a 46% indexation rate. It's then your job to figure out why.

To get full use out of sitemaps you'll want to track Googlebot (and Bingbot) crawl on your site (via weblogs), match those to your sitemaps and then follow it all through to traffic.

Don't go to sleep on sitemaps - invest in them.

Solution 5

In Google's words: "In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it."

But I agree that the best thing you can do if you want your website pages to appear in search engines is to make sure they are crawlable from the site proper.

Share:
42,325

Related videos on Youtube

McDowell
Author by

McDowell

Stack Overflow Valued Associate #00001 Wondering how our software development process works? Take a look! Find me on twitter, or read my blog. Don't say I didn't warn you because I totally did. However, I no longer work at Stack Exchange, Inc. I'll miss you all. Well, some of you, anyway. :)

Updated on September 17, 2022

Comments

  • McDowell
    McDowell over 1 year

    We use a sitemap on Stack Overflow, but I have mixed feelings about it.

    Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

    Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap:

    1. Sitemaps are intended for sites that are hard to crawl properly.
    2. If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it!

    That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you!

    Google goes out of their way to make no sitemap guarantees:

    "We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation

    "We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation

    "submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation

    Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees.

    By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience.

    That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions.

    Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?

    • Admin
      Admin over 13 years
      I thought sitemaps were more of a simpler tool for a simpler age... I kind of figured the only reason to provide a sitemap nowadays was for human assistance in navigating the site, albeit technically inclined humans. I don't see a problem with "if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you!" but it may just be me.
    • Admin
      Admin over 13 years
      While I know that Google generates the bulk of your traffic. I think its important to understand how are other spiders seem to use the sitemap.
    • Admin
      Admin over 13 years
      @mikej google is not "the bulk" of our traffic, it is 99.6% of all search traffic and 87% of total traffic
    • Admin
      Admin over 13 years
      Jeff, always love your posts... I only use XML sitemaps for pages that google might not otherwise find. But really i have been dissapointed with them and google webmaster tools. I honestly think google does a good enough job indexing available content on a site. No point for a sitemap. Now a sitemap for user navigation is a smart idea. I like the Web.2.0 Footer Site Maps and try to incorporate them in almost any design they are appropriate for...
    • Admin
      Admin over 13 years
      @Frank ... that's what I'm trying to say!!! ;)
    • Admin
      Admin over 13 years
      @Jeff Atwood: Which sitemap links are you referring to about not being picked up?
    • Admin
      Admin over 13 years
      @Jeff Atwood: John Mueller said "we do pick up and index otherwise unlinked URLs" -- does this solve the paradox?
  • John Conde
    John Conde over 13 years
    Sites like stackoverflow get crawled so frequently I'm willing to bet it's faster then using a sitemap.
  • McDowell
    McDowell over 13 years
    there is still a mental cost, as in perceived "safety" of having links that are guaranteed to be indexed regardless of your website's crawl status.. which isn't true in our experience.
  • McDowell
    McDowell over 13 years
    @john still, this is the only rational explanation I can think of what a sitemap could actually do for you. "It can't hurt" is a common refrain but mere existence of a sitemap is harmful (additional complexity, etc) so if it isn't helping, it's still a net negative and it's gotta go.
  • John Conde
    John Conde over 13 years
    @Jeff I wasn't disagreeing. I was just saying SO didn't fit that mold.
  • McDowell
    McDowell over 13 years
    this is kind of confirmed through the link Joshak provided: seomoz.org/blog/do-sitemaps-effect-crawlers
  • DisgruntledGoat
    DisgruntledGoat over 13 years
    I believe the <priority> field is fairly important, to let them know which pages are the most vital. For example on Stack Overflow you have hundreds of tag and user pages which are fine, but nowhere near as important as the questions themselves. If the sitemap sets the question priority to 1 and everything else lower, the questions are more likely to be indexed over other pages.
  • jcolebrand
    jcolebrand over 13 years
    @Jeff Atwood "@John still,..." that's the point that I was attempting to make. It was beneficial at first, but now you don't need it. So why do you persist in trying to have it?
  • Admin
    Admin over 13 years
    are these pages also linked outside your site?
  • Virtuosi Media
    Virtuosi Media over 13 years
    Wouldn't an RSS feed accomplish the same thing?
  • Joshak
    Joshak over 13 years
    There are certainly a lot of things you can do with RSS feeds to improve indexing, however, the data in the article I linked above would suggest that a sitemap is more effective then just an RSS feed.
  • DisgruntledGoat
    DisgruntledGoat over 13 years
    You do use Sitemaps for "self-ranking", right? I mean in ranking the content across one site. Otherwise why the priority field?
  • John Mueller
    John Mueller over 13 years
    The "priority" element is a fairly small signal for us, that we might use if we're very limited with crawling on your site (we don't use it for ranking purposes). For most sites, that's not going to be an issue, so it's fine if you can easily provide useful values, but not something to lose sleep over if you can't. If you can't provide useful values for this and other meta-data elements, then just leave the elements out altogether (don't use "default" values).
  • Stephan Muller
    Stephan Muller over 13 years
    Thanks for this very informative answer. I'm going to stop updating my sitemap and just use the RSS feed as a sitemap from now on.
  • Vilx-
    Vilx- over 11 years
    Is having information 100 levels deep an "issue of crawlability"? For example, if I have a webstore, and there is a long list of products in a category (say, 3000 products). The list is paged and has 200 pages. Naturally, I won't show all of the links. More like 1 2 3 ... 22 **23** 24 ... 198 199 200. So, to find a product on page 100, you'd need to go through about 100 links. Or use the search bar. Would googlebot crawl that, or would it give up after some 20 or so levels? Would a sitemap be the appropriate solution here?
  • Simon Hayter
    Simon Hayter over 11 years
    I disagree, sitemaps once had a purpose and now they are obsolete in my honest opinion. If your site is crawl-able it will find those links, using RSS, Social Media are great ways for Google finding and indexing pages even faster.
  • Martijn
    Martijn over 9 years
    One of my sites has increments of 20 and works via /from=60, it went all the way to 24000 (which was a mistake I had to fix, there arent that many pages), so I think it's safe to assume they do.
  • Mooing Duck
    Mooing Duck over 4 years
    @Martijn: Next buttons can be infinite, and Google's indexer has better things to do. It'll follow them for a while. I assume there's some sort of priority for sites. Sites with lots of incoming links it'll crawl for longer, and random pages with no incoming links will probably crawl only for a while before moving on. I assume.
  • cbdeveloper
    cbdeveloper over 4 years
    What if I add some structured data tags to my blogpost page template. It will not change any visible content in any blog post. But all of them will start to render structured data article tags for example. Am I supposed to update the <lastmod> for all of my blogposts' urls because of that change? It's confusing, because to the user it hasn't changed, but the crawler will see new tags that are important for indexing. What should I do in this case?