Google doesn't crawl CDN files

amazon-cloudfront google-search-console web-crawlers googlebot cdn

9,450

Solution 1

So, the solution seems to be that Amazon cloudfront also evaluates my robots.txt and somehow uses different syntax rules from google.

The working version of my robots.txt is the following:

User-agent: Googlebot-Image
Disallow: /
User-agent: *
Disallow: /homepage
Disallow: /uncategorized
Disallow: /page
Disallow: /category
Disallow: /author
Disallow: /feed
Disallow: /tags
Disallow: /test

A very important note to say that this isn't performing the exact same functions as before. In fact, I took out all blank lines, wildcards and "allow" directives. Meaning that the end result is not the same... but I think is close enough for me. For example it doesn't exclude tag pages when passed in query string...

Three important notes:

If you're testing with this don't forget to invalidate robots.txt in cloudfront distribution for each iteration. Just checking you're being served the last version is not enough.
I couldn't find anywhere a definition of the robot.txt syntax understood by amazon cloudfront. So, it was trial and error.
To test results use the "fetch and render" tool of google webmaster and their mobile friendly tester (https://www.google.com/webmasters/tools/mobile-friendly/)

I don't understand why cloudfront is validating and evaluating my robots.txt. This file is a "deal" with me and the crawlers that come to my site. Amazon has no business in the middle. Messing with my robots.txt is just plain stupid.

It never came across my mind that cloudfront could be second guessing my robots.txt syntax.

Solution 2

Create a robots.txt in a bucket.

Create another origin for your cloudfront distribution.

Set your bucket's priority higher then your website.

Invalidate your site's robots.txt on Cloudfront.

After doing the above, Google will read the sites robots.txt when crawling your site and will get to see the different robots.txt when following links from your cdn.

Solution 3

Google does not block external resources from being indexed via using a robots.txt in the root of the main site. Using a sub domain, a cdn or other is classed as an external domain therefor the only way to block the content is using a header response on the file served by the CDN itself, or by using a robots.txt on the cdn or sub domain.

Using:

#Google images
User-agent: Googlebot-Image
Disallow: /

Should only block images that are local, you will need to do the same on the CDN.

The chances are its a header response problem and you should do a 'CURL' on one of the files on the CDN. It should look something like:

HTTP/1.0 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:51 GMT
ETag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
Age: 7
Content-Length: 0
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: V_da8LHRj269JyqkEO143FLpm8kS7xRh4Wa5acB6xa0Qz3rW3P7-Uw==,iFg6qa2KnhUTQ_xRjuhgUIhj8ubAiBrCs6TXJ_L66YJR583xXWAy-Q==
Via: 1.0 d2625240b33e8b85b3cbea9bb40abb10.cloudfront.net (CloudFront)
Connection: close

Things to look out for are:

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
X-Robots-Tag: googlebot: noindex

Solution 4

Found out the problem: The CloudFront reads the robots.txt and prevents serving the content, but it parses some how different from what robots should, I guess.

For instance, the following content on robots.txt:

Disallow: */wp-contents/ Allow: */wp-contents/themes/

When Googlebot gets it itself, it indexes it; When CloudFront reads it, it doesn't consider the 'Allow' directive, and forbids to serve anything inside */wp-contents/themes/.

Short answer: check the robots.txt on your CloudFront distribution, it might be the problem. Invalidate and update it with a corrected version and it should work!

View more solutions

9,450

tonelot

Updated on September 18, 2022

Comments

tonelot over 1 year
I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website. Right now all the "blocked resources" are .css, .js and images (.jpg, .png) that I serve from Cloudfront CDN.

I've spent a lot of time testing and trying to figure out why google doesn't crawl these files and reports a "resource block" status.

Currently I serve these files from several hostnames like: cdn1.example.com, cdn2.example.com, …

cdn1, cdn2 and the others are CNAME's to the cloudfront distribution name.

Test: I've tried to use directly the cloudfront distribution (no CNAME) but the problem persists.

Currently my robots.txt looks like this:
```
# Google AdSense
User-agent: Mediapartners-Google
Disallow:

#Google images
User-agent: Googlebot-Image
Disallow: /

User-agent: *
Disallow: /homepage
Disallow: /index.php*
Disallow: /uncategorized*
Disallow: /tag/*
Disallow: *feed
Disallow: */page/*
Disallow: *author*
Disallow: *archive*
Disallow: */category*
Disallow: *tag=*
Disallow: /test*
Allow: /
```
And examples of files blocked in one example page:
- cdn1.example.com/wp-content/plugins/wp-forecast/wp-forecast-default.css
- cdn9.example.com/wp-content/plugins/bwp-minify/min/?f=wp-content/themes/magazine/css/font-awesome.min.css,wp-content/themes/magazine/css/responsive.css
- cdn5.example.com/wp-content/themes/magazine/images/nobg.png
- cdn6.example.com/wp-content/plugins/floating-social-bar/images/fsb-sprite.png
- cdn5.example.com/wp-content/uploads/2013/11/Design-Hotel-3-80x80.jpg
- cdn5.example.com/wp-content/uploads/2013/11/Marta-Hotel-7-270x225.jpg
I've even tried to allow everything in robots.txt but I always have the same result.

I've also been looking carefully at CloudFront settings in Amazon and see nothing that could be related (I don't use and never used the option :"Restrict Viewer Access (Use Signed URLs or Signed Cookies)".

Right now I've spent a lot of time looking into this and have no more ideas.

Can someone can think of a reason why Googlebot would be blocked from crawling files hosted in Amazon CloudFront?
tonelot about 9 years

Hi, Thanks for your answer. But my problem is not how to prevent images from being indexed. To avoid confusion I took that out of the robots.txt and the results are the same. Googlebot keeps complaining it's blocked on files I host on cloudfront and I don't know why. Any more ideas? Thanks for yoru attention, miguel
Michael - sqlbot about 9 years

Cloudfront neither "reads" robots.txt nor does any "considering" of its contents, nor any "preventing" of anything. Remember that what you get from cloudfront when you fetch an object tells you nothing about what someone served from another edge location would get, if their edge cached an earlier or later one than what yours did. Also, the leading wildcards are probably a recipe for unexpected behavior, since the robots exclusion "standard" is not well-standardized.
Simon Hayter about 9 years

You miss understood, I know that you don't want it blocked... hence why I said at the bottom to ensure that your header response is NOT doing a X-robots-tag, also you say check the robots.txt on your CloudFront distribution I said this too! The only way to block images being indexed on the CDN is x-robots-tag and a robots.txt on the CDN itself, again mentioned.
MrWhite about 9 years

"CloudFront reads the robots.txt" - Is this a robots.txt file hosted on CloudFront itself? The "leading wildcard" would also seem to be unnecessary, if the URLs are anything like those stated in the question.
tonelot about 9 years

Hi. Definitely cloudfront is reading my robots.txt. And also definitely is not accepting the same syntax as google. I had already tested taking out the robots.txt and saw no results beacuse i didn't request and invalidation in cloudfront. Assumed it wasn't necessary because i was being served the latest version. Testing takes a long time beacuse each change requires an invalidation request that takes forever to complete. i'll come back in a few hours with a working version. I don't know why this kind of "smartness" is needed..but it's there and I think it shouldn't. miguel
CraftyScoundrel about 9 years

The same robots.txt present on my apache is the one cloudfront got. I determined it empirically.