Stripping out select querystring attribute/value pairs so varnish will not vary cache by them

17,445

Solution 1

I figured this out and wanted to share. I found this code that makes a subroutine that does what I need.

sub vcl_recv {

    # strip out certain querystring params that varnish should not vary cache by
    call normalize_req_url;

    # snip a bunch of other code
}

sub normalize_req_url {

    # Strip out Google Analytics campaign variables. They are only needed
    # by the javascript running on the page
    # utm_source, utm_medium, utm_campaign, gclid, ...
    if(req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
        set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=[%.-_A-z0-9]+&?", "");
    }
    set req.url = regsub(req.url, "(\?&?)$", "");
}

Solution 2

There's something wrong with the RegEx.
I changed the RegExes used in both regsub calls:

sub normalize_req_url {
    # Clean up root URL
    if (req.url ~ "^/(?:\?.*)?$") {
        set req.url = "/";
    }

    # Strip out Google Analytics campaign variables
    # They are only needed by the javascript running on the page
    # utm_source, utm_medium, utm_campaign, gclid, ...
    if (req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
        set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=[%\._A-z0-9-]+&?", "");
    }
    set req.url = regsub(req.url, "(\?&|\?|&)$", "");
}

The first change is the part "[%._A-z0-9-]", because the dash functioned like a range symbol, that's why I've moved it to the end, and the dot should be escaped.

The second change is to not only remove a question mark at the remaining URL, but also an ampersand or question mark and ampersand.

Solution 3

From https://github.com/mattiasgeniar/varnish-4.0-configuration-templates:

# Some generic URL manipulation, useful for all templates that follow
# First remove the Google Analytics added parameters, useless for our backend
if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
  set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
  set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
  set req.url = regsub(req.url, "\?&", "?");
  set req.url = regsub(req.url, "\?$", "");
}
Share:
17,445
runamok
Author by

runamok

Updated on June 05, 2022

Comments

  • runamok
    runamok about 2 years

    My goal is to "whitelist" certain querystring attributes and their values so varnish will not vary cache between the urls.

    Example:

    Url 1: http://foo.com/someproduct.html?utm_code=google&type=hello  
    Url 2: http://foo.com/someproduct.html?utm_code=yahoo&type=hello  
    Url 3: http://foo.com/someproduct.html?utm_code=yahoo&type=goodbye
    

    In the above example I want to whitelist "utm_code" but not "type" So after the first url is hit I want varnish to serve that cached content to the second url.

    However, in the case of the third url, the attribute "type" value is different so that should be a varnish cache miss.

    I have tried the 2 methods below (found on a drupal help article I can't locate right now) that did not seem to work. Might be because I have the regex wrong.

    # 1. strip out certain querystring values so varnish does not vary cache.
    set req.url = regsuball(req.url, "([\?|&])utm_(campaign|content|medium|source|term)=[^&\s]*&?", "\1");
    # get rid of trailing & or ?
    set req.url = regsuball(req.url, "[\?|&]+$", "");
    
    # 2. strip out certain querystring values so varnish does not vary cache.
    set req.url = regsuball(req.url, "([\?|&])utm_campaign=[^&\s]*&?", "\1");
    set req.url = regsuball(req.url, "([\?|&])foo_bar=[^&\s]*&?", "\1");
    set req.url = regsuball(req.url, "([\?|&])bar_baz=[^&\s]*&?", "\1");
    # get rid of trailing & or ?
    set req.url = regsuball(req.url, "[\?|&]+$", "");