Passing User-Agent through Cloudfront for Facebook scraper

5,441

Solution 1

If you need to know the user-agent for one case, there's little you can do other than whitelisting the User-Agent: header in the relevant CloudFront behavior.

CloudFront caches responses against the request headers it sends, so the net result will be that for a given request, a cached response that was obtained by forwarding a request with User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 will not be considered usable by CloudFront for serving a future request for User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.85 Safari/537.36 even though for all practical purposes, it's the same browser.

When you configure CloudFront to cache based on one or more headers and the headers have more than one possible value, CloudFront forwards more requests to your origin server for the same object. This slows performance and increases the load on your origin server. If your origin server returns the same object regardless of the value of a given header, we recommend that you don't configure CloudFront to cache based on that header.

http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/header-caching.html

The phrase "configure CloudFront to cache based on one or more headers" is synonymous with "whitelisting headers" to be forwarded to the origin.

This is the reason to forward only as much as you need -- to do otherwise hurts your cache hit ratio, in this case because of the variation of User-Agent: strings, which means you're not getting the full benefit of the edge caches, there are more requests processed by the origin server and more bandwidth used between the origin and CloudFront... but there isn't really an alternative. CloudFront doesn't charge anything for storage in the edge caches, so the only cost difference will be whatever is found in those other factors.

The phrase "slows performance" (above) doesn't mean CloudFront gets slower -- it only refers to the reduced likelihood of any particular request being a cache hit, because of the variation in possible header values.

Incidentally, the behavior of CloudFront is correct, in this regard, since a varying User-Agent: can mean a varied response, as indeed you've indicated it does.

Judicious use of path patterns and multiple cache behaviors is the key to getting the best possible use of the CDN cache, given the circumstance you've described. Only whitelisting User-Agent: on path patterns that need it, such as /images/* (which is, of course, a path I just made up) would be advisable. This same advice also applies to cookies and query strings as well as headers. For path patterns where you don't need the cookies and/or query strings, don't enable cookie and/or query string forwarding -- otherwise, cached responses will only be served to users who present the same cookie or for a request where the path and the query string match a cached response -- so, obviously, there would not be a lot of cache hits in such a situation.

Solution 2

It is now possible to use an origin request policy.

https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/controlling-origin-requests.html

This allows us to forward headers to the origin without considering them for caching purposes. While the policy can be fine-tuned as needed, there is a predefined origin request policy called Managed-UserAgentRefererHeaders, which makes the origin User-Agent header visible at the origin (or at an origin request Lambda@edge function for that matter).

https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/using-managed-origin-request-policies.html

Share:
5,441

Related videos on Youtube

Amir Zucker
Author by

Amir Zucker

Updated on September 18, 2022

Comments

  • Amir Zucker
    Amir Zucker over 1 year

    This question is borderline stackoverflow/serverfault, so don't hold it against me that it's here please :)

    I have a service hosted on AWS, nginx with node.js behind it. I have a cloudfront distribution setup to serve requests where the origin is the service (to be able to grow w/o adding application servers)

    Amazon suggests filtering most headers from forwarded requets when setting up cloudfront distributions, specifically User-Agent, which they claim can vary dramatically, thus reducing the effectiveness of the CDN setup.

    This works great for most cases, except when trying to share pages on facebook, in which case I need to know that the user agent is actually facebook (facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)) to be able to return a custom response.

    I would create a special path for facebook share in order to use a custom cloudfront behaviour for such cases, but unfrotunately I can't control what the users will do so the shared url may be the same as "regular" server urls.

    Suggestions?