Nginx location match regex for special characters and encoded url characters

14,207

Solution 1

Your solution is terrible, let me tell you why.

Every single request which matches this location block now has to be evaluated against two if conditions before being served.

Any request which matches then gets redirected to the correct url, which also matches this location block so now your server is doing another two evaluations of those if conditions.

Just for fun you are also making Nginx evaluate requests for image, css and js files against your if conditions too. None of them will match as you are worried about a pdf, but you are still adding an extra 200% overhead to the request processing.

A much more Nginx friendly solution is actually very simple.

Nginx does regex matching in the order the location directives are listed in your config and chooses the first matching block, so if this file url will match any of your other regex directives then you need to place this block above those locations:

location ~* /historical-rainfall-trends-south-africa-1921([^_])*?2015\.pdf$ {
    return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf;
}

Just tested it on one of my servers running Nginx 1.15.1, works a charm.

Solution 2

I don't know about Nginx and the way it handles regex but :

  • You could try to match for percent in the encoded URL with:

    %+

  • You could try to match for the special chars in the encoded URL with:

    (%([A-Z][0-9]|[0-9][A-Z]|[0-9]+|[A-Z]+))+

  • You could try to match for non-ASCII chars in the unencoded URL with:

    [^\x00-\x7F]+

Proofs:

Solution 3

Temporary Solution

Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?

This works now 100%

location /resources { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; } location ~* \.(pdf)$ { expires 30d; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000'; if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } }

Share:
14,207
MitchellK
Author by

MitchellK

BY DAY: Professional photographer at https://mitchellkrog.com BY DAY & NIGHT: Linux Junkie, Solution Finder and Code Writer at https://github.com/mitchellkrogza and https://ubuntu101.co.za

Updated on June 04, 2022

Comments

  • MitchellK
    MitchellK almost 2 years

    I've been trying so many things today and I am just not winning. I have one file in my site which got created by accident with a special character in it. As a result Googlebot has stopped crawling for 3 weeks now and Webmaster tools / Search console keeps notifying me and wanting to retest the url.

    All I want to achieve is to configure Nginx to match the following requests and redirect them to the correct location but regex has me stumped on this one.

    The unencoded URL string is:

    /historical-rainfall-trends-south-africa-1921–2015.pdf

    The encoded URL string is:

    /historical-rainfall-trends-south-africa-1921%C3%A2%E2%82%AC%E2%80%9C2015.pdf

    How can I get a location match for these?

    UPDATE:

    Still losing my mind, nothing I have tried is working. I get a match with this regex here - https://regex101.com/r/3Lk2zr/3

    but then using this

    location ~ /.*[^\x00-\x7F]+.* { return 444; }

    still gives me a 404 and not a 444

    Likewise I get a match with this - https://regex101.com/r/80KWJ8/1 But then

    location ~ /.*([^?]*)\%(.*)$ { return 444; }

    Gives 404 and not 444 😭

    Also tried this but still no work. Sourced from: https://serverfault.com/questions/656096/rewriting-ascii-percent-encoded-locations-to-their-utf-8-encoded-equivalent

    location ~* (*UTF8).*([^?]*)\%(.*)$ { return 444; }

    location ~* (*UTF8).*[^\x00-\x7F]+.* { return 444; }

    Temporary Solution

    Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?

    This works now 100%

    location /resources { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; } location ~* \.(pdf)$ { expires 30d; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000'; if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } }

  • MitchellK
    MitchellK over 5 years
    Thank you funilrys I'll give these a whirl in the morning and report back
  • MitchellK
    MitchellK over 5 years
    Nginx uses PCRE Regex
  • MitchellK
    MitchellK over 5 years
    Thank you miknik this is much much cleaner and simpler and I was aware my solution was not the greatest at all but it worked for the time being. I fully understand the order of locations in Nginx I just totally suck at regex and just could not quite nail this one but you did so I will put yours into action today and come back to you. Many many Thanks
  • MitchellK
    MitchellK over 5 years
    Morning @miknik I tried your solution this morning and I am not succeeding, I placed this location block in various places to try and catch why it's not working but still did not succeed. Will retest this later today and get back to you, something is out of order but I just can't spot it right now and it must be something staring me in the face.
  • MitchellK
    MitchellK over 5 years
    This one is driving me nuts, I even moved this to the highest possible spot in location ordering and it just won't work :sob: just gives me 404's and not the redirect. I will have to test this tomorrow on a new test site and then slowly trace what's wrong and where.
  • miknik
    miknik over 5 years
    Is your return URL within the directive 100% correct? You sure it's not returning 301 to the wrong location? What do your logs say? Have you set a root directive at the server level?
  • MitchellK
    MitchellK over 5 years
    Hi @miknik yes the 301 is 100% correct, I've tried placing your location statement everywhere and just keep getting a 404 but when using the if statements (now modified blocks & temp solution updated) it still works 100%. I cannot fathom this.