string sanitizer for filename

178,167

Solution 1

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

Solution 2

Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:

// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks @Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);

Solution 3

This is how you can sanitize filenames for a file system as asked

function filter_filename($name) {
    // remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
    $name = str_replace(array_merge(
        array_map('chr', range(0, 31)),
        array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
    ), '', $name);
    // maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
    $ext = pathinfo($name, PATHINFO_EXTENSION);
    $name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
    return $name;
}

Everything else is allowed in a filesystem, so the question is perfectly answered...

... but it could be dangerous to allow for example single quotes ' in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:

 ' onerror= 'alert(document.cookie).jpg

becomes an XSS hole:

<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />

Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:

$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )

Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.

Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.

So finally I would suggest to use this:

function filter_filename($filename, $beautify=true) {
    // sanitize filename
    $filename = preg_replace(
        '~
        [<>:"/\\\|?*]|            # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
        [\x00-\x1F]|             # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
        [\x7F\xA0\xAD]|          # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
        [#\[\]@!$&\'()+,;=]|     # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
        [{}^\~`]                 # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
        ~x',
        '-', $filename);
    // avoids ".", ".." or ".hiddenFiles"
    $filename = ltrim($filename, '.-');
    // optional beautification
    if ($beautify) $filename = beautify_filename($filename);
    // maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
    $ext = pathinfo($filename, PATHINFO_EXTENSION);
    $filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
    return $filename;
}

Everything else that does not cause problems with the file system should be part of an additional function:

function beautify_filename($filename) {
    // reduce consecutive characters
    $filename = preg_replace(array(
        // "file   name.zip" becomes "file-name.zip"
        '/ +/',
        // "file___name.zip" becomes "file-name.zip"
        '/_+/',
        // "file---name.zip" becomes "file-name.zip"
        '/-+/'
    ), '-', $filename);
    $filename = preg_replace(array(
        // "file--.--.-.--name.zip" becomes "file.name.zip"
        '/-*\.-*/',
        // "file...name..zip" becomes "file.name.zip"
        '/\.{2,}/'
    ), '.', $filename);
    // lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
    $filename = mb_strtolower($filename, mb_detect_encoding($filename));
    // ".file-name.-" becomes "file-name"
    $filename = trim($filename, '.-');
    return $filename;
}

And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.

The only thing you have to do is to use urlencode() (as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg becomes this URL as your <img src> or <a href>: http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg

Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg

So this is a complete legal filename and not a problem as @SequenceDigitale.com mentioned in his answer.

Solution 4

SOLUTION 1 - simple and effective

$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );

  • strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
  • [^a-z0-9]+ will ensure, the filename only keeps letters and numbers
  • Substitute invalid characters with '-' keeps the filename readable

Example:

URL:  http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename

SOLUTION 2 - for very long URLs

You want to cache the URL contents and just need to have unique filenames. I would use this function:

$file_name = md5( strtolower( $url ) )

this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.

Example:

URL:  https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c

Solution 5

What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php

Here is a function that sanitize even Chinese Chars:

public static function normalizeString ($str = '')
{
    $str = strip_tags($str); 
    $str = preg_replace('/[\r\n\t ]+/', ' ', $str);
    $str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
    $str = strtolower($str);
    $str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
    $str = htmlentities($str, ENT_QUOTES, "utf-8");
    $str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
    $str = str_replace(' ', '-', $str);
    $str = rawurlencode($str);
    $str = str_replace('%', '-', $str);
    return $str;
}

Here is the explaination

  1. Strip HTML Tags
  2. Remove Break/Tabs/Return Carriage
  3. Remove Illegal Chars for folder and filename
  4. Put the string in lower case
  5. Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
  6. Replace Spaces with dashes
  7. Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
  8. Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.

OK, some filename will not be releavant but in most case it will work.

ex. Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"

Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"

It's better like that than an 404 error.

Hope that was helpful.

Carl.

Share:
178,167
user151841
Author by

user151841

Updated on July 08, 2022

Comments

  • user151841
    user151841 almost 2 years

    I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?

    ( I could write one, but I'm worried that I'll overlook a character! )

    Edit: for saving files on a Windows NTFS filesystem.

  • Tor Valamo
    Tor Valamo over 14 years
    so a filename can't have a period or an underscore, or anything like that?
  • Dominic Rodger
    Dominic Rodger over 14 years
    That would allow through filenames like .., which may or may not be a problem.
  • Dominic Rodger
    Dominic Rodger over 14 years
    @Jonathan - what's with the italics?
  • Sampson
    Sampson over 14 years
    @Tor, yes, sorry. Updated. @Dominic, just drawing emphasis on the text.
  • Tor Valamo
    Tor Valamo over 14 years
    @Dom - just check for that separately, since it's a fixed value.
  • user151841
    user151841 over 14 years
    What is gism? I get " Warning: preg_replace() [function.preg-replace]: Unknown modifier 'g' "
  • Sampson
    Sampson over 14 years
    g - global, i - insensitive case, s - dotall, m - multiline. In this example, you could do without s and m.
  • Pekka
    Pekka over 14 years
    No good for languages with Umlauts. This would result in Qubec for Québec, Dsseldorf for Düsseldorf, and so on.
  • Dominic Rodger
    Dominic Rodger over 14 years
    True - but like I said: "For example".
  • Blair McMillan
    Blair McMillan over 14 years
    Which may be perfectly acceptable to the OP. Otherwise, use something like php.net/manual/en/class.normalizer.php
  • i.am.michiel
    i.am.michiel about 11 years
    That is actually not what was asked. The op asks for a function to sanitize string, not a alternative.
  • Dominic Rodger
    Dominic Rodger about 11 years
    @i.am.michiel, perhaps, but given the OP accepted it, I'll assume they found it helpful.
  • Travis Pessetto
    Travis Pessetto about 11 years
    Where is it said he would be replacing with NULL? Also, this does not handle all special characters.
  • AgelessEssence
    AgelessEssence almost 11 years
    this regex returns warning " Unknown modifier '|' ", check at codepad.org/jf6O0OOY
  • Sean Vieira
    Sean Vieira almost 11 years
    @iim.hlk - yep, it was missing the wrapping parenthesis. I've added those now. Thanks!
  • Ronald Hulshof
    Ronald Hulshof about 10 years
    For Umlauts you can always include the following snippet: $string = strtr( $string, "ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöøùúûüýÿÑñ", "AAAAAACEEEEIIIIOOOOOOUUUUYaaaaaaceeeeiiiiooooooouuuuyyNn" );
  • JamesHalsall
    JamesHalsall about 10 years
    This doesn't handle file names like "image.jpeg", it produces "imagejpeg"
  • Sean Vieira
    Sean Vieira about 10 years
    @JamesHalsall - correct. I've updated it so it does :-) Thanks for making the answer better!
  • Hayley
    Hayley about 10 years
    Not an answer to the question, should be a comment.
  • Dominic Rodger
    Dominic Rodger about 10 years
    Thanks @asdasd, but as I said, the OP accepting it makes me think they found it helpful.
  • rineez
    rineez over 9 years
    @user151841 For preg_replace the global flag is implicit. So there is no need for g if preg_replace is being used. When we want to control the number of replacements preg_replace has a limit parameter for that. Read the preg_replace documentation for more.
  • 23W
    23W over 9 years
    double check for ']' in file name. may be '\(\]' must be '\(\)' ?
  • Sean Vieira
    Sean Vieira over 9 years
    @23W - wow that survived for a long time - thanks for helping make the answer better!
  • Paul Hutchinson
    Paul Hutchinson over 9 years
    I'm not sure you want to let the colon (:) through on Windows as you can change drives that way (ie "d:\junk.txt" will get converted to d:junk.txt)
  • falstro
    falstro over 9 years
    there's a flaw in there, you should split it into two and run the check for .. afterwards. For example .?. would end up being ... Although since you filter / I can't see how you'd exploit that further right now, but it shows why the check for .. is ineffective here. Better yet probably, don't replace, just reject if it doesn't qualify.
  • Tarulia
    Tarulia over 9 years
    Not quite sure why but it doesn't seem to replace colons. Here's an example online: clicky. I might as well have an error in there, little sleepy :P
  • Alex Reinking
    Alex Reinking about 9 years
    You might also want to check that the file doesn't begin with a .. Wouldn't want to overwrite / create hidden files, or things like .htaccess, .htpasswd, etc.
  • cemper93
    cemper93 almost 9 years
    This is insufficient! For example, the filename "./.name" will still break out of the current directory. (Removing .. does nothing here, but removing / will turn the ./. into .. and hence break out of the target directory.)
  • cdhowie
    cdhowie almost 9 years
    @cemper93 No, this answer will just turn the string into ..name which would not break out of anything. Removing all path separator characters should be sufficient to prevent any directory traversal. (The removal of .. is technically unnecessary.)
  • Martin Kovachev
    Martin Kovachev over 8 years
    Yup - there are other special characters which need handling too. str_replace won't be the best bid here anyway.
  • Łukasz Rysiak
    Łukasz Rysiak over 8 years
    since i've used your solution, i have to mention, that if you use this solution with utf-8, you should switch to mb_ereg_replace. Otherwise chars will be messed up.
  • Sven
    Sven over 8 years
    @RonaldHulshof: Your snippet does not account for multibyte characters. For that you'd have to create a transformation array with key = umlaut, value = regular char and pass it as second parameter to strtr(). Alternatively, use iconv('UTF-8','ASCII//TRANSLIT',$string);
  • Mr Pablo
    Mr Pablo over 8 years
    This answer is terrible. Why would you allow the characters -_~,;:[]() in a filename?!
  • Sean Vieira
    Sean Vieira over 8 years
    Because none of those values are illegal on the Windows file system and why loose more information than you have to? You could change the regular expression to simply [^a-z0-9_-] if you want to be really restrictive - or just use a generated name and throw away the given name and avoid all these problems. :-)
  • Basil Musa
    Basil Musa over 8 years
    You are not removing NULL and Control characters. ASCII of 0 to 32 should all be removed from the string.
  • JasonXA
    JasonXA about 8 years
    Note that : is illegal.
  • Sean Vieira
    Sean Vieira about 8 years
    Updated - thanks for helping make the answer better!
  • Slava
    Slava about 8 years
    Will not work with other alphabets, like Файл.docx
  • Slava
    Slava about 8 years
    I would add trim() to trim spaces before and after, so that copy-pasted ` filename.txt ` would sanitize to filename.txt
  • Slava
    Slava about 8 years
    Also, leaving whitespace characters like Tab, New line and Carriage return makes no sense in a file name. I suggest replacing \s with a literal space (hit spacebar). As a result: trim(mb_ereg_replace("([^\w \d\-_~,;\[\]\(\).])", '', $file)).
  • Slava
    Slava about 8 years
    @falstro file..name.txt is a perfectly valid file name. Why would one reject it?
  • falstro
    falstro about 8 years
    @Alph.Dev because the discussion was about the file called .. (which is typically a hard link to a parent directory), not arbitrary usage within a file name.
  • ChrisJJ
    ChrisJJ over 7 years
    This will fail to 'make it ready to use for a filename' if the results is too long.
  • matteo
    matteo over 7 years
    @BlairMcMillan how would Normalizer help? None of the types of Unicode normalizations seem to have anything to do with guaranteeing the fitness of a string as filename for a particular type filesystem.
  • mgutt
    mgutt about 7 years
    @Alph.Dev Its not "sense" related, its simply forbidden to use those whitespace characters in Windows: stackoverflow.com/a/42058764/318765 @falstro Your suggestion does not make sense as / is removed and ..filename does not target the parent directory. The only filename that could be a problem is .. or .hiddenFilen, but you can handle it with ltrim() as mentioned in my answer as well.
  • mgutt
    mgutt about 7 years
    UTF-8 is allowed in the file system and it is allowed in URLs, so why should it produce an 404 error? The only thing you need to do is to encode the URL http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg to http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1‌​%83%94%E1%83%AD%E1%8‌​3%93%E1%83%98_%E1%83‌​%9B%E1%83%90%E1%83%9‌​C%E1%83%A5%E1%83%90%‌​E1%83%9C%E1%83%90.jp‌​g in the HTML source code as you hopefully do with all your URLs.
  • mgutt
    mgutt about 7 years
    Some other points: You remove HTML tags through strip_tags() and after that you remove [<>]. By that strip_tags() is not really needed at all. The same point are the quotes. There are no quotes left when you decode with ENT_QUOTES. And the str_replace() does not remove consecutive white spaces and then you use strtolower() for mult-byte string. And why do you convert to lowercase at all? And finally you did not catch any reserved character as @BasilMusa mentioned. More details in my answer: stackoverflow.com/a/42058764/318765
  • mgutt
    mgutt about 7 years
    @cdhowie Yes, but the filename ./. becomes ... And finally this answer misses all other file system reserved characters like NULL. More in my answer: stackoverflow.com/a/42058764/318765
  • mgutt
    mgutt about 7 years
    Why do you want to replace diacritics? Simply use urlencode() before you use the filename as a src or href. The only currently used file system that has problems with UTF-8 is FATx (used by XBOX): en.wikipedia.org/wiki/Comparison_of_file_systems#Limits And I do not think this is used by web servers
  • Slava
    Slava about 7 years
    @mgutt What is your point? Forbidden or useless, it makes no difference. I suggest to remove/replace them so that we can have a valid filename afterwards. We are sanitizing file names aren't we here?
  • mgutt
    mgutt about 7 years
    @Alph.Dev It is a difference for this answer. As it is forbidden the answer of SeanVieira is completely wrong because its unsafe to use. That was the point I liked to highlight as it is the most popular answer.
  • Admin
    Admin about 7 years
    Good job. The most helpful answer for me. +1
  • Admin
    Admin about 7 years
    Oh... The function works well, but since some time it started putting - between every character, like r-u-l-e-s and I have no idea why this happen. Sure is that it is not fault of the function, but just asking - what might be reason of such behavior? Wrong encoding?
  • Admin
    Admin about 7 years
    Oh well... Just made a debug and it happens just after the preg_replace in filter_filename().
  • Admin
    Admin about 7 years
    After removing these comments, it started working again.
  • mgutt
    mgutt about 7 years
    Which comments did you remove? Send me an email if this is easier: gutt.it/contact.htm
  • Admin
    Admin about 7 years
    those from first preg_replace.
  • mikeytown2
    mikeytown2 about 7 years
    Note that mb_strtolower can create ? and \.
  • mgutt
    mgutt about 7 years
    @mikextown2 Are you sure? Should not happen because of mb_detect_encoding
  • Patrick Janser
    Patrick Janser almost 7 years
    Great digging and complete answer! Thanks for the work!
  • Yash Kumar Verma
    Yash Kumar Verma over 6 years
    fell in love with it !
  • Aaron Esau
    Aaron Esau about 6 years
    Is there a regex string for this?
  • adilbo
    adilbo almost 6 years
    Maybe MD5 could by a Problem: Be careful when using hashes with URL’s. While the square root of the number skrenta.com/2007/08/md5_tutorial.html of URL’s is still a lot bigger then the current web size if you do get a collision you are going to get pages about Britney Spears when you were expecting pages about Bugzilla. Its probably a non issue in our case, but for billions of pages I would opt for a much larger hashing algorithm such as SHA 256 or avoid it altogether. Source: boyter.org/2013/01/code-for-a-search-engine-in-php-part-1
  • TheRealChx101
    TheRealChx101 over 5 years
    What about non-printable characters? It's better to use the white list approach than black list approach in this case. Basically allow only the printable ASCII file names excluding the special letters of course. But for non-english locales, that's another problem.
  • vatavale
    vatavale almost 5 years
    Special thanks for the comments technique inside regexp!
  • vatavale
    vatavale almost 5 years
    I added "u" modifier to the end of the regexp for work with Unicode filenames.
  • func0der
    func0der almost 5 years
    Good, but it would not remove slashes, which could be a problem: Directory traversing.
  • spackmat
    spackmat about 4 years
    Beware: The double backslash in the RegEx must be additionally escaped with a third one for the PHP string. preg_replace('~[<>:"/\\|?*]~x','-', $filename) will otherwise let Hello\World.txt pass! Change [<>:"/\\|?*] to [<>:"/\\\|?*] to fix that.
  • TekOps
    TekOps almost 4 years
    Can you write an example and post it?
  • Smith
    Smith over 3 years
    You need to add the file extension separated by a ".": $name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension;
  • rolinger
    rolinger over 3 years
    excellent write up. I thought PHP would have something built in for this and was surprised that it didn't. But this serves my needs way more than I ever would have been able to write.
  • MMMahdy-PAPION
    MMMahdy-PAPION over 3 years
    I think using mb_ereg_replace for keeping any language character is the most wise way, but like this: mb_regex_encoding("UTF-8"); then $fixedfilename=mb_ereg_replace('^[\s]+|[^\P{C}]|[\\\\\/\*\:\‌​?\"\>\<\|]+|[\s\.]+$‌​','',$filename); because we have to remove somethings else like removing useless dots and spaces from end. Also it is better avoid to accept characters like ` and ' and ; and % and & that can have meanings for URL or PHP or HTML. A possible one line fast fixer can be this: PHP Sandbox
  • thelr
    thelr almost 3 years
    On Windows, the list of illegal, common characters for file names is \ / : * ? " < > |. EVERY one of those is allowed by the FILTER_SANITIZE_URL rule.
  • Gianpaolo Scrigna
    Gianpaolo Scrigna over 2 years
    Solution 1 ❤️. That's all I needed in my simple download method.
  • Matoeil
    Matoeil over 2 years
    please give the code to it
  • dobs
    dobs almost 2 years
    As variant - FILTER_SANITIZE_EMAIL. Remove all characters except letters, digits and !#$%&'*+-=?^_`{|}~@.[].