Sanitizing strings to make them URL and filename safe?

193,094

Solution 1

Some observations on your solution:

  1. 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
  2. \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
  3. The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

Solution 2

I found this larger function in the Chyrp code:

/**
 * Function: sanitize
 * Returns a sanitized string, typically for URLs.
 *
 * Parameters:
 *     $string - The string to sanitize.
 *     $force_lowercase - Force the string to lowercase?
 *     $anal - If set to *true*, will remove all non-alphanumeric characters.
 */
function sanitize($string, $force_lowercase = true, $anal = false) {
    $strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
                   "}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
                   "—", "–", ",", "<", ".", ">", "/", "?");
    $clean = trim(str_replace($strip, "", strip_tags($string)));
    $clean = preg_replace('/\s+/', "-", $clean);
    $clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

and this one in the wordpress code

/**
 * Sanitizes a filename replacing whitespace with dashes
 *
 * Removes special characters that are illegal in filenames on certain
 * operating systems and special characters requiring special escaping
 * to manipulate at the command line. Replaces spaces and consecutive
 * dashes with a single dash. Trim period, dash and underscore from beginning
 * and end of filename.
 *
 * @since 2.1.0
 *
 * @param string $filename The filename to be sanitized
 * @return string The sanitized filename
 */
function sanitize_file_name( $filename ) {
    $filename_raw = $filename;
    $special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
    $special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
    $filename = str_replace($special_chars, '', $filename);
    $filename = preg_replace('/[\s-]+/', '-', $filename);
    $filename = trim($filename, '.-_');
    return apply_filters('sanitize_file_name', $filename, $filename_raw);
}

Update Sept 2012

Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.

Solution 3

This should make your filenames safe...

$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);

and a deeper solution to this is:

// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);

This assumes that you want a dot in the filename. if you want it transferred to lowercase, just use

$clean_name = strtolower($clean_name);

for the last line.

Solution 4

Try this:

function normal_chars($string)
{
    $string = htmlentities($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
    $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);

    return trim($string, ' -');
}

Examples:

echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA

Based on the selected answer in this thread: URL Friendly Username in PHP?

Solution 5

This isn't exactly an answer as it doesn't provide any solutions (yet!), but it's too big to fit on a comment...


I did some testing (regarding file names) on Windows 7 and Ubuntu 12.04 and what I found out was that:

1. PHP Can't Handle non-ASCII Filenames

Although both Windows and Ubuntu can handle Unicode filenames (even RTL ones as it seems) PHP 5.3 requires hacks to deal even with the plain old ISO-8859-1, so it's better to keep it ASCII only for safety.

2. The Lenght of the Filename Matters (Specially on Windows)

On Ubuntu, the maximum length a filename can have (incluinding extension) is 255 (excluding path):

/var/www/uploads/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345/

However, on Windows 7 (NTFS) the maximum lenght a filename can have depends on it's absolute path:

(0 + 0 + 244 + 11 chars) C:\1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234\1234567.txt
(0 + 3 + 240 + 11 chars) C:\123\123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\1234567.txt
(3 + 3 + 236 + 11 chars) C:\123\456\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456\1234567.txt

Wikipedia says that:

NTFS allows each path component (directory or filename) to be 255 characters long.

To the best of my knowledge (and testing), this is wrong.

In total (counting slashes) all these examples have 259 chars, if you strip the C:\ that gives 256 characters (not 255?!). The directories where created using the Explorer and you'll notice that it restrains itself from using all the available space for the directory name. The reason for this is to allow the creation of files using the 8.3 file naming convention. The same thing happens for other partitions.

Files don't need to reserve the 8.3 lenght requirements of course:

(255 chars) E:\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901.txt

You can't create any more sub-directories if the absolute path of the parent directory has more than 242 characters, because 256 = 242 + 1 + \ + 8 + . + 3. Using Windows Explorer, you can't create another directory if the parent directory has more than 233 characters (depending on the system locale), because 256 = 233 + 10 + \ + 8 + . + 3; the 10 here is the length of the string New folder.

Windows file system poses a nasty problem if you want to assure inter-operability between file systems.

3. Beware of Reserved Characters and Keywords

Aside from removing non-ASCII, non-printable and control characters, you also need to re(place/move):

"*/:<>?\|

Just removing these characters might not be the best idea because the filename might lose some of it's meaning. I think that, at the very least, multiple occurences of these characters should be replaced by a single underscore (_), or perhaps something more representative (this is just an idea):

  • "*? -> _
  • /\| -> -
  • : -> [ ]-[ ]
  • < -> (
  • > -> )

There are also special keywords that should be avoided (like NUL), although I'm not sure how to overcome that. Perhaps a black list with a random name fallback would be a good approach to solve it.

4. Case Sensitiveness

This should go without saying, but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case, that way my_file.txt and My_File.txt on Linux won't both become the same my_file.txt file on Windows.

5. Make Sure It's Unique

If the file name already exists, a unique identifier should be appended to it's base file name.

Common unique identifiers include the UNIX timestamp, a digest of the file contents or a random string.

6. Hidden Files

Just because it can be named doesn't mean it should...

Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot.

7. Other Considerations

If you have to strip some chars of the file name, the extension is usually more important than the base name of the file. Allowing a considerable maximum number of characters for the file extension (8-16) one should strip the characters from the base name. It's also important to note that in the unlikely event of having a more than one long extension - such as _.graphmlz.tag.gz - _.graphmlz.tag only _ should be considered as the file base name in this case.

8. Resources

Calibre handles file name mangling pretty decently:

Wikipedia page on file name mangling and linked chapter from Using Samba.


If for instance, you try to create a file that violates any of the rules 1/2/3, you'll get a very useful error:

Warning: touch(): Unable to create file ... because No error in ... on line ...
Share:
193,094
Xeoncross
Author by

Xeoncross

PHP, Javascript, and Go Application developer responsible for over 50 open source projects and libraries at https://github.com/xeoncross By default I build Go backends with AngularJS frontends. Thanks to Ionic and Electron this even works for mobile and desktop apps. Bash, PHP, Python, Node.js, and random linux libraries are used for specific tasks because of the size of the ecosystems or libraries for odd jobs.

Updated on March 13, 2020

Comments

  • Xeoncross
    Xeoncross about 4 years

    I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

    So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

    /**
     * Convert a string to the file/URL safe "slug" form
     *
     * @param string $string the string to clean
     * @param bool $is_filename TRUE will allow additional filename characters
     * @return string
     */
    function sanitize($string = '', $is_filename = FALSE)
    {
     // Replace all weird characters with dashes
     $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
    
     // Only allow one dash separator at a time (and make string lowercase)
     return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
    }
    

    Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

    $is-filename allows some additional characters like temp vim files

    update: removed the star character since I could not think of a valid use

    • elias
      elias about 14 years
      You better remove everything except [\w.-]
    • Matt Gibson
      Matt Gibson over 13 years
      You may find the Normalizer and the comments on it useful.
  • Xeoncross
    Xeoncross about 14 years
    I agree, most of the methods listed here remove known dangerous characters - my method removes everything that isn't a known safe character. Since most systems slug encode post URL's I would suggest we continue to follow this proven method rather than using the documented UTF-8 unsafe urlencode().
  • Xeoncross
    Xeoncross about 14 years
    Very nice - I have never seen this done without a translation table (like wordpress uses). However, I don't think this function is enough as-is since it only translates special characters but does not remove dangerous characters. Maybe it can be added to one above...
  • Xeoncross
    Xeoncross about 14 years
    You are correct about my assumption of the "u" modifier - I thought that it was for the text. I also forgot about the \w modifier including the underscore. I would normally convert all accented characters to ASCII - but I want this to work for other languages as well. I was assuming that there would be some kind of UTF-8 safe way that any character of a language could be used in a URL slug or filename so that even Arabic titles would work. After all, linux supports UTF-8 filenames and browsers should encode HTML links as needed. Big thanks for your input here.
  • Alan Donnelly
    Alan Donnelly about 14 years
    Ha! That entity encoding hack is sweet! Though it's not at all clear at first glance how this method does what it does. There's a problem though. Won't "Frédéric & Éric" turn into "Frederic amp Eric"?
  • Alan Donnelly
    Alan Donnelly about 14 years
    On second thought, you're actually right, but it's not just an issue with the browser encoding the links correctly. The easiest way to achieve close to what you want is to map non-ASCII characters to their closest ASCII equivalent and then URL-encode your link in the HTML body. The hard way is to ensure consistent UTF-8 encoding (or UTF-16, I think for some Chinese dialects) from your data store, through your webserver, application layer (PHP), page content, web browser and not urlencode your urls (but still strip 'undesirable' chars). This will give you nice non-encoded links and URLs.
  • Xeoncross
    Xeoncross about 14 years
    Good advice. I'm going to try to create a pure UTF-8 environment. Then, taking a several strings from non-ASCII languages, I'll remove dangerous chars (./;:etc...) and creating files and then HTML links to those files to see if I can click them and see if all this works. If not then I'll probably have to drop back to (raw)?urlencode() to allow UTF-8. I'll post back results here.
  • Xeoncross
    Xeoncross about 14 years
    Yes, testing for mb_strlen() is always an important thing!
  • Xeoncross
    Xeoncross about 14 years
    I created a file called สังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่.txt and then created a UTF-8 HTML file with a link to it. Amazingly it worked - even on windows! However, I then had PHP file_put_contents('สังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่.txt') and it failed creating a bazaar filename from that string. Then I tried to create it with fopen() and got the same messed up filename. So apparently PHP (on windows at least) is incapable of creating UTF-8 filenames. bugs.php.net/bug.php?id=46990&thanks=6
  • Xeoncross
    Xeoncross about 14 years
    The % character is not recommended for filenames and hex encoded characters do not look as nice in the URL. Browsers can support UTF-8 strings which are much nicer and easier for non-ascii languages.
  • Xeoncross
    Xeoncross almost 14 years
    I award this answer because it got me thinking the most and also included a useful link to a project I never heard of that is worth looking into. I'll post once I find a the answer though.
  • CodeVirtuoso
    CodeVirtuoso over 13 years
    Definitely - also, taking filename control away from users will prevent a possibility of 2 uploads having the same name.
  • Francesco
    Francesco about 13 years
    you could do a urlencode and THEN a str_replace('%20','-',url) ?
  • Alix Axel
    Alix Axel over 12 years
    @AlanDonnelly: Indeed, I've updated the function in my original answer (check the link), the trim() should also be trim($string, '-').
  • Alix Axel
    Alix Axel over 12 years
    @Xeoncross: The last preg_replace() should remove all dangerous chars.
  • Xeoncross
    Xeoncross over 12 years
    @AlixAxel, your just everywhere aren't you. I was just reading over the PHP AWS SDK and they had some of your code for UUID's. The awesome code of phunction is just hard to beat.
  • Alix Axel
    Alix Axel over 12 years
    @Xeoncross: Thanks for letting me know, wasn't even aware of that! =)
  • Xeoncross
    Xeoncross over 11 years
    I'm not so sure about this, for one .\x00..\x20 can be reduced to .\x00\x20.
  • Alix Axel
    Alix Axel over 11 years
    @Xeoncross: I think that .\x00..\x20 removes dots and every character between \x00 and \x20, whereas .\x00\x20 should only remove those 3 bytes.
  • Kevin Mark
    Kevin Mark over 11 years
    The WordPress code isn't portable as it makes use of apply_filters
  • Xeoncross
    Xeoncross over 11 years
    This assumes mostly Latin based input. Add more UTF-8 characters from other languages to see where you will have problems.
  • COil
    COil over 11 years
    @Xeoncross I agree, as Christian said one must save an Id or hash AND the original filename. But this function provides an alternative as you can specify a default string when the sanitize process fails. I have added an unit test for this case. Thanks for reporting the bug.
  • Xeoncross
    Xeoncross over 11 years
    That post is very short-sighed and assumes everything is english.
  • Yotam Omer
    Yotam Omer over 10 years
    Note that the wordpress version replaces /[\s-]+/ with - which is better than the first version (which replaces only /\s+/) that can cause multiple dashes in a row
  • Xeoncross
    Xeoncross over 9 years
    This looks bad. \\s+ means a backslash followed by one or more whitespace. What is that about? Also, this uses blacklisting rather than whitelisting ignoring things like CMD, null, or BEL.
  • Xeoncross
    Xeoncross over 9 years
    Still bad. Now strings like /blog/2014-02/just-in-time are not allowed. Please use the tested code above or use the phunction PHP framework code.
  • joan16v
    joan16v over 9 years
    That's right. This function is only for the "just-in-time" part. Could be useful for some people.
  • Xeoncross
    Xeoncross over 9 years
    You can change the regex preg_replace('~[^\-\pL\pN\s]+~u', '-', $string)
  • joan16v
    joan16v over 9 years
    Awesome! I added also: string = trim($string, "-");
  • WackGet
    WackGet about 9 years
    I wanted to use this to convert a bunch of TV episode names into Windows-based filenames, keeping their extensions, square brackets, dashes and single quotes, and changing colons to dots or dashes where appropriate. So here's my version: pastebin.com/0CsEV0Ax
  • Manuel Arwed Schmidt
    Manuel Arwed Schmidt over 8 years
    This answer requires more explaination for it to be safely used. Not much information about the exact syntax for charlist on the net.
  • Kristoffer Bohmann
    Kristoffer Bohmann over 8 years
    New link to OWASP PHP ESAPI: https://github.com/OWASP/PHP-ESAPI
  • Jasom Dotnet
    Jasom Dotnet over 8 years
    Still missing some Czech and Slovak characters: 'ľ' => 'l', 'Ľ' => 'L', 'č' => 'c', 'Č' => 'C', 'ť' => 't', 'Ť' => 'T', 'ň' => 'n', 'Ň' => 'N', 'ĺ' => 'l', 'Ĺ' => 'L', 'Ř' => 'R', 'ř' => 'r', 'ě' => 'e', 'Ě' => 'E', 'ů' => 'u', 'Ů' => 'U'
  • Jasom Dotnet
    Jasom Dotnet over 8 years
    Still missing some Czech and Slovak characters: 'ľ' => 'l', 'Ľ' => 'L', 'č' => 'c', 'Č' => 'C', 'ť' => 't', 'Ť' => 'T', 'ň' => 'n', 'Ň' => 'N', 'ĺ' => 'l', 'Ĺ' => 'L', 'Ř' => 'R', 'ř' => 'r', 'ě' => 'e', 'Ě' => 'E', 'ů' => 'u', 'Ů' => 'U'
  • cbmtrx
    cbmtrx over 8 years
    And no doubt many more. I'm actually trying to figure out if there exists an ISO- set that includes combinations of characters. How does one "choose" one set if the content demands characters from all of them? UTF-8 I'm assuming...
  • Jasom Dotnet
    Jasom Dotnet over 8 years
    I found out how to transliterate any string using one line of PHP: $string = transliterator_transliterate('Any-Latin;Latin-ASCII;', $string); See my answer below or read linked blog post.
  • cbmtrx
    cbmtrx over 8 years
    IF you're using Drupal and IF you install an extension. Not really "one line of PHP".
  • Jasom Dotnet
    Jasom Dotnet over 8 years
    No, you have read it wrong: IF you can install PHP extensions on your server (or hosting) :-) Here's the post.
  • cbmtrx
    cbmtrx over 8 years
    Ah, got it. Thanks @JasomDotnet --I have my current solution working for now but it's a limited character set so the extension is worth checking out.
  • erikvimz
    erikvimz over 8 years
    Just for reference wordpress apply_filters can be found here and sanitize_file_name over here.
  • Maciek Semik
    Maciek Semik almost 8 years
    what about multiple spaces? Replace
  • David Goodwin
    David Goodwin over 7 years
    Thanks - that looks ideal for my purposes.
  • viljun
    viljun over 7 years
    The $anal -variable sounds very frightening to me with the force-option.
  • Jonathan
    Jonathan over 7 years
    At 425ms it's pretty slow, just FYI