string sanitizer for filename
Solution 1
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z
, 0-9
, _
, and a single instance of a period (.
). That's obviously more limiting than most filesystems, but should keep you safe.
Solution 2
Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:
// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks @Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);
Solution 3
This is how you can sanitize filenames for a file system as asked
function filter_filename($name) {
// remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
$name = str_replace(array_merge(
array_map('chr', range(0, 31)),
array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
), '', $name);
// maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($name, PATHINFO_EXTENSION);
$name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
return $name;
}
Everything else is allowed in a filesystem, so the question is perfectly answered...
... but it could be dangerous to allow for example single quotes '
in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:
' onerror= 'alert(document.cookie).jpg
becomes an XSS hole:
<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />
Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )
Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.
Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.
So finally I would suggest to use this:
function filter_filename($filename, $beautify=true) {
// sanitize filename
$filename = preg_replace(
'~
[<>:"/\\\|?*]| # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
[\x00-\x1F]| # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
[\x7F\xA0\xAD]| # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
[#\[\]@!$&\'()+,;=]| # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
[{}^\~`] # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
~x',
'-', $filename);
// avoids ".", ".." or ".hiddenFiles"
$filename = ltrim($filename, '.-');
// optional beautification
if ($beautify) $filename = beautify_filename($filename);
// maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
return $filename;
}
Everything else that does not cause problems with the file system should be part of an additional function:
function beautify_filename($filename) {
// reduce consecutive characters
$filename = preg_replace(array(
// "file name.zip" becomes "file-name.zip"
'/ +/',
// "file___name.zip" becomes "file-name.zip"
'/_+/',
// "file---name.zip" becomes "file-name.zip"
'/-+/'
), '-', $filename);
$filename = preg_replace(array(
// "file--.--.-.--name.zip" becomes "file.name.zip"
'/-*\.-*/',
// "file...name..zip" becomes "file.name.zip"
'/\.{2,}/'
), '.', $filename);
// lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));
// ".file-name.-" becomes "file-name"
$filename = trim($filename, '.-');
return $filename;
}
And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.
The only thing you have to do is to use urlencode()
(as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg
becomes this URL as your <img src>
or <a href>
:
http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg
So this is a complete legal filename and not a problem as @SequenceDigitale.com mentioned in his answer.
Solution 4
SOLUTION 1 - simple and effective
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
- strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
-
[^a-z0-9]+
will ensure, the filename only keeps letters and numbers - Substitute invalid characters with
'-'
keeps the filename readable
Example:
URL: http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
SOLUTION 2 - for very long URLs
You want to cache the URL contents and just need to have unique filenames. I would use this function:
$file_name = md5( strtolower( $url ) )
this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.
Example:
URL: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c
Solution 5
What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php
Here is a function that sanitize even Chinese Chars:
public static function normalizeString ($str = '')
{
$str = strip_tags($str);
$str = preg_replace('/[\r\n\t ]+/', ' ', $str);
$str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
$str = strtolower($str);
$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
$str = htmlentities($str, ENT_QUOTES, "utf-8");
$str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
$str = str_replace(' ', '-', $str);
$str = rawurlencode($str);
$str = str_replace('%', '-', $str);
return $str;
}
Here is the explaination
- Strip HTML Tags
- Remove Break/Tabs/Return Carriage
- Remove Illegal Chars for folder and filename
- Put the string in lower case
- Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
- Replace Spaces with dashes
- Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
- Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.
OK, some filename will not be releavant but in most case it will work.
ex. Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"
Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
It's better like that than an 404 error.
Hope that was helpful.
Carl.
user151841
Updated on July 08, 2022Comments
-
user151841 almost 2 years
I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?
( I could write one, but I'm worried that I'll overlook a character! )
Edit: for saving files on a Windows NTFS filesystem.
-
Tor Valamo over 14 yearsso a filename can't have a period or an underscore, or anything like that?
-
Dominic Rodger over 14 yearsThat would allow through filenames like
..
, which may or may not be a problem. -
Dominic Rodger over 14 years@Jonathan - what's with the italics?
-
Sampson over 14 years@Tor, yes, sorry. Updated. @Dominic, just drawing emphasis on the text.
-
Tor Valamo over 14 years@Dom - just check for that separately, since it's a fixed value.
-
user151841 over 14 yearsWhat is gism? I get " Warning: preg_replace() [function.preg-replace]: Unknown modifier 'g' "
-
Sampson over 14 years
g
- global,i
- insensitive case,s
- dotall,m
- multiline. In this example, you could do withouts
andm
. -
Pekka over 14 yearsNo good for languages with Umlauts. This would result in Qubec for Québec, Dsseldorf for Düsseldorf, and so on.
-
Dominic Rodger over 14 yearsTrue - but like I said: "For example".
-
Blair McMillan over 14 yearsWhich may be perfectly acceptable to the OP. Otherwise, use something like php.net/manual/en/class.normalizer.php
-
i.am.michiel about 11 yearsThat is actually not what was asked. The op asks for a function to sanitize string, not a alternative.
-
Dominic Rodger about 11 years@i.am.michiel, perhaps, but given the OP accepted it, I'll assume they found it helpful.
-
Travis Pessetto about 11 yearsWhere is it said he would be replacing with NULL? Also, this does not handle all special characters.
-
AgelessEssence almost 11 yearsthis regex returns warning " Unknown modifier '|' ", check at codepad.org/jf6O0OOY
-
Sean Vieira almost 11 years@iim.hlk - yep, it was missing the wrapping parenthesis. I've added those now. Thanks!
-
Ronald Hulshof about 10 yearsFor Umlauts you can always include the following snippet: $string = strtr( $string, "ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöøùúûüýÿÑñ", "AAAAAACEEEEIIIIOOOOOOUUUUYaaaaaaceeeeiiiiooooooouuuuyyNn" );
-
JamesHalsall about 10 yearsThis doesn't handle file names like "image.jpeg", it produces "imagejpeg"
-
Sean Vieira about 10 years@JamesHalsall - correct. I've updated it so it does :-) Thanks for making the answer better!
-
Hayley about 10 yearsNot an answer to the question, should be a comment.
-
Dominic Rodger about 10 yearsThanks @asdasd, but as I said, the OP accepting it makes me think they found it helpful.
-
rineez over 9 years@user151841 For
preg_replace
the global flag is implicit. So there is no need for g if preg_replace is being used. When we want to control the number of replacements preg_replace has alimit
parameter for that. Read the preg_replace documentation for more. -
23W over 9 yearsdouble check for ']' in file name. may be '\(\]' must be '\(\)' ?
-
Sean Vieira over 9 years@23W - wow that survived for a long time - thanks for helping make the answer better!
-
Paul Hutchinson over 9 yearsI'm not sure you want to let the colon (:) through on Windows as you can change drives that way (ie "d:\junk.txt" will get converted to d:junk.txt)
-
falstro over 9 yearsthere's a flaw in there, you should split it into two and run the check for
..
afterwards. For example.?.
would end up being..
. Although since you filter/
I can't see how you'd exploit that further right now, but it shows why the check for..
is ineffective here. Better yet probably, don't replace, just reject if it doesn't qualify. -
Tarulia over 9 yearsNot quite sure why but it doesn't seem to replace colons. Here's an example online: clicky. I might as well have an error in there, little sleepy :P
-
Alex Reinking about 9 yearsYou might also want to check that the file doesn't begin with a
.
. Wouldn't want to overwrite / create hidden files, or things like .htaccess, .htpasswd, etc. -
cemper93 almost 9 yearsThis is insufficient! For example, the filename "./.name" will still break out of the current directory. (Removing .. does nothing here, but removing / will turn the ./. into .. and hence break out of the target directory.)
-
cdhowie almost 9 years@cemper93 No, this answer will just turn the string into
..name
which would not break out of anything. Removing all path separator characters should be sufficient to prevent any directory traversal. (The removal of..
is technically unnecessary.) -
Martin Kovachev over 8 yearsYup - there are other special characters which need handling too. str_replace won't be the best bid here anyway.
-
Łukasz Rysiak over 8 yearssince i've used your solution, i have to mention, that if you use this solution with utf-8, you should switch to mb_ereg_replace. Otherwise chars will be messed up.
-
Sven over 8 years@RonaldHulshof: Your snippet does not account for multibyte characters. For that you'd have to create a transformation array with key = umlaut, value = regular char and pass it as second parameter to
strtr()
. Alternatively, useiconv('UTF-8','ASCII//TRANSLIT',$string);
-
Mr Pablo over 8 yearsThis answer is terrible. Why would you allow the characters
-_~,;:[]()
in a filename?! -
Sean Vieira over 8 yearsBecause none of those values are illegal on the Windows file system and why loose more information than you have to? You could change the regular expression to simply
[^a-z0-9_-]
if you want to be really restrictive - or just use a generated name and throw away the given name and avoid all these problems. :-) -
Basil Musa over 8 yearsYou are not removing NULL and Control characters. ASCII of 0 to 32 should all be removed from the string.
-
JasonXA about 8 yearsNote that : is illegal.
-
Sean Vieira about 8 yearsUpdated - thanks for helping make the answer better!
-
Slava about 8 yearsWill not work with other alphabets, like
Файл.docx
-
Slava about 8 yearsI would add
trim()
to trim spaces before and after, so that copy-pasted ` filename.txt ` would sanitize tofilename.txt
-
Slava about 8 yearsAlso, leaving whitespace characters like Tab, New line and Carriage return makes no sense in a file name. I suggest replacing
\s
with a literal space (hit spacebar). As a result:trim(mb_ereg_replace("([^\w \d\-_~,;\[\]\(\).])", '', $file))
. -
Slava about 8 years@falstro
file..name.txt
is a perfectly valid file name. Why would one reject it? -
falstro about 8 years@Alph.Dev because the discussion was about the file called
..
(which is typically a hard link to a parent directory), not arbitrary usage within a file name. -
ChrisJJ over 7 yearsThis will fail to 'make it ready to use for a filename' if the results is too long.
-
matteo over 7 years@BlairMcMillan how would Normalizer help? None of the types of Unicode normalizations seem to have anything to do with guaranteeing the fitness of a string as filename for a particular type filesystem.
-
mgutt about 7 years@Alph.Dev Its not "sense" related, its simply forbidden to use those whitespace characters in Windows: stackoverflow.com/a/42058764/318765 @falstro Your suggestion does not make sense as
/
is removed and..filename
does not target the parent directory. The only filename that could be a problem is..
or.hiddenFilen
, but you can handle it withltrim()
as mentioned in my answer as well. -
mgutt about 7 yearsUTF-8 is allowed in the file system and it is allowed in URLs, so why should it produce an 404 error? The only thing you need to do is to encode the URL
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg
tohttp://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
in the HTML source code as you hopefully do with all your URLs. -
mgutt about 7 yearsSome other points: You remove HTML tags through
strip_tags()
and after that you remove[<>]
. By thatstrip_tags()
is not really needed at all. The same point are the quotes. There are no quotes left when you decode withENT_QUOTES
. And thestr_replace()
does not remove consecutive white spaces and then you usestrtolower()
for mult-byte string. And why do you convert to lowercase at all? And finally you did not catch any reserved character as @BasilMusa mentioned. More details in my answer: stackoverflow.com/a/42058764/318765 -
mgutt about 7 years@cdhowie Yes, but the filename
./.
becomes..
. And finally this answer misses all other file system reserved characters like NULL. More in my answer: stackoverflow.com/a/42058764/318765 -
mgutt about 7 yearsWhy do you want to replace diacritics? Simply use
urlencode()
before you use the filename as asrc
orhref
. The only currently used file system that has problems with UTF-8 is FATx (used by XBOX): en.wikipedia.org/wiki/Comparison_of_file_systems#Limits And I do not think this is used by web servers -
Slava about 7 years@mgutt What is your point? Forbidden or useless, it makes no difference. I suggest to remove/replace them so that we can have a valid filename afterwards. We are sanitizing file names aren't we here?
-
mgutt about 7 years@Alph.Dev It is a difference for this answer. As it is forbidden the answer of SeanVieira is completely wrong because its unsafe to use. That was the point I liked to highlight as it is the most popular answer.
-
Admin about 7 yearsGood job. The most helpful answer for me. +1
-
Admin about 7 yearsOh... The function works well, but since some time it started putting - between every character, like
r-u-l-e-s
and I have no idea why this happen. Sure is that it is not fault of the function, but just asking - what might be reason of such behavior? Wrong encoding? -
Admin about 7 yearsOh well... Just made a debug and it happens just after the
preg_replace
infilter_filename()
. -
Admin about 7 yearsAfter removing these comments, it started working again.
-
mgutt about 7 yearsWhich comments did you remove? Send me an email if this is easier: gutt.it/contact.htm
-
Admin about 7 yearsthose from first
preg_replace
. -
mikeytown2 about 7 yearsNote that mb_strtolower can create
?
and \. -
mgutt about 7 years@mikextown2 Are you sure? Should not happen because of
mb_detect_encoding
-
Patrick Janser almost 7 yearsGreat digging and complete answer! Thanks for the work!
-
Yash Kumar Verma over 6 yearsfell in love with it !
-
Aaron Esau about 6 yearsIs there a regex string for this?
-
adilbo almost 6 yearsMaybe MD5 could by a Problem: Be careful when using hashes with URL’s. While the square root of the number skrenta.com/2007/08/md5_tutorial.html of URL’s is still a lot bigger then the current web size if you do get a collision you are going to get pages about Britney Spears when you were expecting pages about Bugzilla. Its probably a non issue in our case, but for billions of pages I would opt for a much larger hashing algorithm such as SHA 256 or avoid it altogether. Source: boyter.org/2013/01/code-for-a-search-engine-in-php-part-1
-
TheRealChx101 over 5 yearsWhat about non-printable characters? It's better to use the white list approach than black list approach in this case. Basically allow only the printable ASCII file names excluding the special letters of course. But for non-english locales, that's another problem.
-
vatavale almost 5 yearsSpecial thanks for the comments technique inside regexp!
-
vatavale almost 5 yearsI added "u" modifier to the end of the regexp for work with Unicode filenames.
-
func0der almost 5 yearsGood, but it would not remove slashes, which could be a problem: Directory traversing.
-
spackmat about 4 yearsBeware: The double backslash in the RegEx must be additionally escaped with a third one for the PHP string.
preg_replace('~[<>:"/\\|?*]~x','-', $filename)
will otherwise letHello\World.txt
pass! Change[<>:"/\\|?*]
to[<>:"/\\\|?*]
to fix that. -
TekOps almost 4 yearsCan you write an example and post it?
-
Smith over 3 yearsYou need to add the file extension separated by a ".": $name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension;
-
rolinger over 3 yearsexcellent write up. I thought PHP would have something built in for this and was surprised that it didn't. But this serves my needs way more than I ever would have been able to write.
-
MMMahdy-PAPION over 3 yearsI think using
mb_ereg_replace
for keeping any language character is the most wise way, but like this:mb_regex_encoding("UTF-8");
then$fixedfilename=mb_ereg_replace('^[\s]+|[^\P{C}]|[\\\\\/\*\:\?\"\>\<\|]+|[\s\.]+$','',$filename);
because we have to remove somethings else like removing useless dots and spaces from end. Also it is better avoid to accept characters like ` and ' and ; and % and & that can have meanings for URL or PHP or HTML. A possible one line fast fixer can be this: PHP Sandbox -
thelr almost 3 yearsOn Windows, the list of illegal, common characters for file names is
\ / : * ? " < > |
. EVERY one of those is allowed by theFILTER_SANITIZE_URL
rule. -
Gianpaolo Scrigna over 2 yearsSolution 1 ❤️. That's all I needed in my simple download method.
-
Matoeil over 2 yearsplease give the code to it
-
dobs almost 2 yearsAs variant -
FILTER_SANITIZE_EMAIL
. Remove all characters except letters, digits and!#$%&'*+-=?^_`{|}~@.[]
.