Sanitizing strings to make them URL and filename safe?
Solution 1
Some observations on your solution:
- 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
- \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
- The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
Creating the slug
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
Sanitization in general
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
The Encoder interface provides:
canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)
https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
Solution 2
I found this larger function in the Chyrp code:
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
and this one in the wordpress code
/**
* Sanitizes a filename replacing whitespace with dashes
*
* Removes special characters that are illegal in filenames on certain
* operating systems and special characters requiring special escaping
* to manipulate at the command line. Replaces spaces and consecutive
* dashes with a single dash. Trim period, dash and underscore from beginning
* and end of filename.
*
* @since 2.1.0
*
* @param string $filename The filename to be sanitized
* @return string The sanitized filename
*/
function sanitize_file_name( $filename ) {
$filename_raw = $filename;
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
$special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
$filename = str_replace($special_chars, '', $filename);
$filename = preg_replace('/[\s-]+/', '-', $filename);
$filename = trim($filename, '.-_');
return apply_filters('sanitize_file_name', $filename, $filename_raw);
}
Update Sept 2012
Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.
Solution 3
This should make your filenames safe...
$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);
and a deeper solution to this is:
// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);
This assumes that you want a dot in the filename. if you want it transferred to lowercase, just use
$clean_name = strtolower($clean_name);
for the last line.
Solution 4
Try this:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);
return trim($string, ' -');
}
Examples:
echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?
Solution 5
This isn't exactly an answer as it doesn't provide any solutions (yet!), but it's too big to fit on a comment...
I did some testing (regarding file names) on Windows 7 and Ubuntu 12.04 and what I found out was that:
1. PHP Can't Handle non-ASCII Filenames
Although both Windows and Ubuntu can handle Unicode filenames (even RTL ones as it seems) PHP 5.3 requires hacks to deal even with the plain old ISO-8859-1, so it's better to keep it ASCII only for safety.
2. The Lenght of the Filename Matters (Specially on Windows)
On Ubuntu, the maximum length a filename can have (incluinding extension) is 255 (excluding path):
/var/www/uploads/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345/
However, on Windows 7 (NTFS) the maximum lenght a filename can have depends on it's absolute path:
(0 + 0 + 244 + 11 chars) C:\1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234\1234567.txt
(0 + 3 + 240 + 11 chars) C:\123\123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\1234567.txt
(3 + 3 + 236 + 11 chars) C:\123\456\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456\1234567.txt
Wikipedia says that:
NTFS allows each path component (directory or filename) to be 255 characters long.
To the best of my knowledge (and testing), this is wrong.
In total (counting slashes) all these examples have 259 chars, if you strip the C:\
that gives 256 characters (not 255?!). The directories where created using the Explorer and you'll notice that it restrains itself from using all the available space for the directory name. The reason for this is to allow the creation of files using the 8.3 file naming convention. The same thing happens for other partitions.
Files don't need to reserve the 8.3 lenght requirements of course:
(255 chars) E:\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901.txt
You can't create any more sub-directories if the absolute path of the parent directory has more than 242 characters, because 256 = 242 + 1 + \ + 8 + . + 3
. Using Windows Explorer, you can't create another directory if the parent directory has more than 233 characters (depending on the system locale), because 256 = 233 + 10 + \ + 8 + . + 3
; the 10
here is the length of the string New folder
.
Windows file system poses a nasty problem if you want to assure inter-operability between file systems.
3. Beware of Reserved Characters and Keywords
Aside from removing non-ASCII, non-printable and control characters, you also need to re(place/move):
"*/:<>?\|
Just removing these characters might not be the best idea because the filename might lose some of it's meaning. I think that, at the very least, multiple occurences of these characters should be replaced by a single underscore (_
), or perhaps something more representative (this is just an idea):
-
"*?
->_
-
/\|
->-
-
:
->[ ]-[ ]
-
<
->(
-
>
->)
There are also special keywords that should be avoided (like NUL
), although I'm not sure how to overcome that. Perhaps a black list with a random name fallback would be a good approach to solve it.
4. Case Sensitiveness
This should go without saying, but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case, that way my_file.txt
and My_File.txt
on Linux won't both become the same my_file.txt
file on Windows.
5. Make Sure It's Unique
If the file name already exists, a unique identifier should be appended to it's base file name.
Common unique identifiers include the UNIX timestamp, a digest of the file contents or a random string.
6. Hidden Files
Just because it can be named doesn't mean it should...
Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot.
7. Other Considerations
If you have to strip some chars of the file name, the extension is usually more important than the base name of the file. Allowing a considerable maximum number of characters for the file extension (8-16) one should strip the characters from the base name. It's also important to note that in the unlikely event of having a more than one long extension - such as _.graphmlz.tag.gz
- _.graphmlz.tag
only _
should be considered as the file base name in this case.
8. Resources
Calibre handles file name mangling pretty decently:
Wikipedia page on file name mangling and linked chapter from Using Samba.
If for instance, you try to create a file that violates any of the rules 1/2/3, you'll get a very useful error:
Warning: touch(): Unable to create file ... because No error in ... on line ...
Xeoncross
PHP, Javascript, and Go Application developer responsible for over 50 open source projects and libraries at https://github.com/xeoncross By default I build Go backends with AngularJS frontends. Thanks to Ionic and Electron this even works for mobile and desktop apps. Bash, PHP, Python, Node.js, and random linux libraries are used for specific tasks because of the size of the ecosystems or libraries for odd jobs.
Updated on March 13, 2020Comments
-
Xeoncross about 4 years
I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.
So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.
/** * Convert a string to the file/URL safe "slug" form * * @param string $string the string to clean * @param bool $is_filename TRUE will allow additional filename characters * @return string */ function sanitize($string = '', $is_filename = FALSE) { // Replace all weird characters with dashes $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string); // Only allow one dash separator at a time (and make string lowercase) return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8'); }
Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?
$is-filename allows some additional characters like temp vim files
update: removed the star character since I could not think of a valid use
-
elias about 14 yearsYou better remove everything except [\w.-]
-
Matt Gibson over 13 yearsYou may find the Normalizer and the comments on it useful.
-
-
Xeoncross about 14 yearsI agree, most of the methods listed here remove known dangerous characters - my method removes everything that isn't a known safe character. Since most systems slug encode post URL's I would suggest we continue to follow this proven method rather than using the documented UTF-8 unsafe urlencode().
-
Xeoncross about 14 yearsVery nice - I have never seen this done without a translation table (like wordpress uses). However, I don't think this function is enough as-is since it only translates special characters but does not remove dangerous characters. Maybe it can be added to one above...
-
Xeoncross about 14 yearsYou are correct about my assumption of the "u" modifier - I thought that it was for the text. I also forgot about the \w modifier including the underscore. I would normally convert all accented characters to ASCII - but I want this to work for other languages as well. I was assuming that there would be some kind of UTF-8 safe way that any character of a language could be used in a URL slug or filename so that even Arabic titles would work. After all, linux supports UTF-8 filenames and browsers should encode HTML links as needed. Big thanks for your input here.
-
Alan Donnelly about 14 yearsHa! That entity encoding hack is sweet! Though it's not at all clear at first glance how this method does what it does. There's a problem though. Won't "Frédéric & Éric" turn into "Frederic amp Eric"?
-
Alan Donnelly about 14 yearsOn second thought, you're actually right, but it's not just an issue with the browser encoding the links correctly. The easiest way to achieve close to what you want is to map non-ASCII characters to their closest ASCII equivalent and then URL-encode your link in the HTML body. The hard way is to ensure consistent UTF-8 encoding (or UTF-16, I think for some Chinese dialects) from your data store, through your webserver, application layer (PHP), page content, web browser and not urlencode your urls (but still strip 'undesirable' chars). This will give you nice non-encoded links and URLs.
-
Xeoncross about 14 yearsGood advice. I'm going to try to create a pure UTF-8 environment. Then, taking a several strings from non-ASCII languages, I'll remove dangerous chars (./;:etc...) and creating files and then HTML links to those files to see if I can click them and see if all this works. If not then I'll probably have to drop back to (raw)?urlencode() to allow UTF-8. I'll post back results here.
-
Xeoncross about 14 yearsYes, testing for mb_strlen() is always an important thing!
-
Xeoncross about 14 yearsI created a file called
สังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่.txt
and then created a UTF-8 HTML file with a link to it. Amazingly it worked - even on windows! However, I then had PHPfile_put_contents('สังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่.txt')
and it failed creating a bazaar filename from that string. Then I tried to create it withfopen()
and got the same messed up filename. So apparently PHP (on windows at least) is incapable of creating UTF-8 filenames. bugs.php.net/bug.php?id=46990&thanks=6 -
Xeoncross about 14 yearsThe % character is not recommended for filenames and hex encoded characters do not look as nice in the URL. Browsers can support UTF-8 strings which are much nicer and easier for non-ascii languages.
-
Xeoncross almost 14 yearsI award this answer because it got me thinking the most and also included a useful link to a project I never heard of that is worth looking into. I'll post once I find a the answer though.
-
CodeVirtuoso over 13 yearsDefinitely - also, taking filename control away from users will prevent a possibility of 2 uploads having the same name.
-
Francesco about 13 yearsyou could do a urlencode and THEN a str_replace('%20','-',url) ?
-
Alix Axel over 12 years@AlanDonnelly: Indeed, I've updated the function in my original answer (check the link), the
trim()
should also betrim($string, '-')
. -
Alix Axel over 12 years@Xeoncross: The last
preg_replace()
should remove all dangerous chars. -
Xeoncross over 12 years@AlixAxel, your just everywhere aren't you. I was just reading over the PHP AWS SDK and they had some of your code for UUID's. The awesome code of phunction is just hard to beat.
-
Alix Axel over 12 years@Xeoncross: Thanks for letting me know, wasn't even aware of that! =)
-
Xeoncross over 11 yearsI'm not so sure about this, for one
.\x00..\x20
can be reduced to.\x00\x20
. -
Alix Axel over 11 years@Xeoncross: I think that
.\x00..\x20
removes dots and every character between\x00
and\x20
, whereas.\x00\x20
should only remove those 3 bytes. -
Kevin Mark over 11 yearsThe WordPress code isn't portable as it makes use of
apply_filters
-
Xeoncross over 11 yearsThis assumes mostly Latin based input. Add more UTF-8 characters from other languages to see where you will have problems.
-
COil over 11 years@Xeoncross I agree, as Christian said one must save an Id or hash AND the original filename. But this function provides an alternative as you can specify a default string when the sanitize process fails. I have added an unit test for this case. Thanks for reporting the bug.
-
Xeoncross over 11 yearsThat post is very short-sighed and assumes everything is english.
-
Yotam Omer over 10 yearsNote that the wordpress version replaces
/[\s-]+/
with-
which is better than the first version (which replaces only/\s+/
) that can cause multiple dashes in a row -
Xeoncross over 9 yearsThis looks bad.
\\s+
means a backslash followed by one or more whitespace. What is that about? Also, this uses blacklisting rather than whitelisting ignoring things likeCMD
, null, orBEL
. -
Xeoncross over 9 yearsStill bad. Now strings like
/blog/2014-02/just-in-time
are not allowed. Please use the tested code above or use thephunction
PHP framework code. -
joan16v over 9 yearsThat's right. This function is only for the "just-in-time" part. Could be useful for some people.
-
Xeoncross over 9 yearsYou can change the regex
preg_replace('~[^\-\pL\pN\s]+~u', '-', $string)
-
joan16v over 9 yearsAwesome! I added also: string = trim($string, "-");
-
WackGet about 9 yearsI wanted to use this to convert a bunch of TV episode names into Windows-based filenames, keeping their extensions, square brackets, dashes and single quotes, and changing colons to dots or dashes where appropriate. So here's my version: pastebin.com/0CsEV0Ax
-
Manuel Arwed Schmidt over 8 yearsThis answer requires more explaination for it to be safely used. Not much information about the exact syntax for charlist on the net.
-
Kristoffer Bohmann over 8 yearsNew link to OWASP PHP ESAPI: https://github.com/OWASP/PHP-ESAPI
-
Jasom Dotnet over 8 yearsStill missing some Czech and Slovak characters:
'ľ' => 'l', 'Ľ' => 'L', 'č' => 'c', 'Č' => 'C', 'ť' => 't', 'Ť' => 'T', 'ň' => 'n', 'Ň' => 'N', 'ĺ' => 'l', 'Ĺ' => 'L', 'Ř' => 'R', 'ř' => 'r', 'ě' => 'e', 'Ě' => 'E', 'ů' => 'u', 'Ů' => 'U'
-
Jasom Dotnet over 8 yearsStill missing some Czech and Slovak characters:
'ľ' => 'l', 'Ľ' => 'L', 'č' => 'c', 'Č' => 'C', 'ť' => 't', 'Ť' => 'T', 'ň' => 'n', 'Ň' => 'N', 'ĺ' => 'l', 'Ĺ' => 'L', 'Ř' => 'R', 'ř' => 'r', 'ě' => 'e', 'Ě' => 'E', 'ů' => 'u', 'Ů' => 'U'
-
cbmtrx over 8 yearsAnd no doubt many more. I'm actually trying to figure out if there exists an ISO- set that includes combinations of characters. How does one "choose" one set if the content demands characters from all of them? UTF-8 I'm assuming...
-
Jasom Dotnet over 8 yearsI found out how to transliterate any string using one line of PHP:
$string = transliterator_transliterate('Any-Latin;Latin-ASCII;', $string);
See my answer below or read linked blog post. -
cbmtrx over 8 yearsIF you're using Drupal and IF you install an extension. Not really "one line of PHP".
-
Jasom Dotnet over 8 yearsNo, you have read it wrong: IF you can install PHP extensions on your server (or hosting) :-) Here's the post.
-
cbmtrx over 8 yearsAh, got it. Thanks @JasomDotnet --I have my current solution working for now but it's a limited character set so the extension is worth checking out.
-
erikvimz over 8 years
-
Maciek Semik almost 8 yearswhat about multiple spaces? Replace
-
David Goodwin over 7 yearsThanks - that looks ideal for my purposes.
-
viljun over 7 yearsThe $anal -variable sounds very frightening to me with the force-option.
-
Jonathan over 7 yearsAt 425ms it's pretty slow, just FYI