Is there a cross-platform Java method to remove filename special chars?
Solution 1
As suggested elsewhere, this is not usually what you want to do. It is usually best to create a temporary file using a secure method such as File.createTempFile().
You should not do this with a whitelist and only keep 'good' characters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use a whitelist for this reason, we have to use a blacklist.
Linux pretty much allows anything which can be a real pain. I would just limit Linux to the same list that you limit Windows to so you save yourself headaches in the future.
Using this C# snippet on Windows I produced a list of characters that are not valid on Windows. There are quite a few more characters in this list than you may think (41) so I wouldn't recommend trying to create your own list.
foreach (char c in new string(Path.GetInvalidFileNameChars()))
{
Console.Write((int)c);
Console.Write(",");
}
Here is a simple Java class which 'cleans' a file name.
public class FileNameCleaner {
final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};
static {
Arrays.sort(illegalChars);
}
public static String cleanFileName(String badFileName) {
StringBuilder cleanName = new StringBuilder();
for (int i = 0; i < badFileName.length(); i++) {
int c = (int)badFileName.charAt(i);
if (Arrays.binarySearch(illegalChars, c) < 0) {
cleanName.append((char)c);
}
}
return cleanName.toString();
}
}
EDIT: As Stephen suggested you probably also should verify that these file accesses only occur within the directory you allow.
The following answer has sample code for establishing a custom security context in Java and then executing code in that 'sandbox'.
How do you create a secure JEXL (scripting) sandbox?
Solution 2
or just do this:
String filename = "A20/B22b#öA\\BC#Ä$%ld_ma.la.xps";
String sane = filename.replaceAll("[^a-zA-Z0-9\\._]+", "_");
Result: A20_B22b_A_BC_ld_ma.la.xps
Explanation:
[a-zA-Z0-9\\._]
matches a letter from a-z lower or uppercase, numbers, dots and underscores
[^a-zA-Z0-9\\._]
is the inverse. i.e. all characters which do not match the first expression
[^a-zA-Z0-9\\._]+
is a sequence of characters which do not match the first expression
So every sequence of characters which does not consist of characters from a-z, 0-9 or . _ will be replaced.
Solution 3
This is based on the accepted answer by Sarel Botha which works fine as long as you don't encounter any characters outside of the Basic Multilingual Plane. If you need full Unicode support (and who doesn't?) use this code instead which is Unicode safe:
public class FileNameCleaner {
final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};
static {
Arrays.sort(illegalChars);
}
public static String cleanFileName(String badFileName) {
StringBuilder cleanName = new StringBuilder();
int len = badFileName.codePointCount(0, badFileName.length());
for (int i=0; i<len; i++) {
int c = badFileName.codePointAt(i);
if (Arrays.binarySearch(illegalChars, c) < 0) {
cleanName.appendCodePoint(c);
}
}
return cleanName.toString();
}
}
Key changes here:
- Use codePointCount i.c.w.
length
instead of justlength
- use codePointAt instead of
charAt
- use appendCodePoint instead of
append
- No need to cast
char
s toint
s. In fact, you should never deal withchar
s as they are basically broken for anything outside the BMP.
Solution 4
Here is the code I use:
public static String sanitizeName( String name ) {
if( null == name ) {
return "";
}
if( SystemUtils.IS_OS_LINUX ) {
return name.replaceAll( "[\u0000/]+", "" ).trim();
}
return name.replaceAll( "[\u0000-\u001f<>:\"/\\\\|?*\u007f]+", "" ).trim();
}
SystemUtils
is from Apache commons-lang3
Solution 5
There's a pretty good built-in Java solution - Character.isXxx().
Try Character.isJavaIdentifierPart(c)
:
String name = "name.é+!@#$%^&*(){}][/=?+-_\\|;:`~!'\",<>";
StringBuilder filename = new StringBuilder();
for (char c : name.toCharArray()) {
if (c=='.' || Character.isJavaIdentifierPart(c)) {
filename.append(c);
}
}
Result is "name.é$_".
Ben S
Mobile Software Engineering Manager at Square currently working on Cash App iOS with a University of Waterloo bachelor's degree in Computer Science, Software Engineering Option. Previous experience at Google, Amazon.com, OpenText, Research In Motion, Sybase and Bridgewater Systems.
Updated on July 08, 2022Comments
-
Ben S almost 2 years
I'm making a cross-platform application that renames files based on data retrieved online. I'd like to sanitize the Strings I took from a web API for the current platform.
I know that different platforms have different file-name requirements, so I was wondering if there's a cross-platform way to do this?
Edit: On Windows platforms you cannot have a question mark '?' in a file name, whereas in Linux, you can. The file names may contain such characters and I would like for the platforms that support those characters to keep them, but otherwise, strip them out.
Also, I would prefer a standard Java solution that doesn't require third-party libraries.
-
THelper over 12 yearsGood java example, but why didn't you include the forward slash (47)?
-
Sarel Botha over 12 yearsNo idea why it's not in the list. We actually just ran into this problem in production code. I've fixed the answer to include 47. Thanks.
-
Mark D over 11 yearsokay, so it's a conservative way and doesn't meet the original question fully (cross-platform), but worked for me :)
-
Jaime Hablutzel about 11 yearsIt does remove hyphen which is valid for filenames (at least in Windows) but it does the job, anyway I think Apache Commons FilenameUtils should incorporate a cross platform way to get this done
-
Franz Kafka almost 11 yearsThe illegalChars array has to be sorted for
binarySearch
to work properly. Please addArrays.sort(illegalChars)
or change the array to "{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 42, 47, 58, 60, 62, 63, 92, 124}" -
Sarel Botha over 10 yearsThis works on a file name that uses only English letters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use whitelists on strings to strip bad characters for this reason, we have to use blacklists.
-
D-rk over 10 yearsHave a look here: stackoverflow.com/questions/9576384/… it should work if you use Java 7
-
azerafati about 10 yearsalso it removes "@" too which is again valid in Windows.
-
Stijn de Witt over 9 yearsYour solution uses
charAt()
... Basically you should never usecharAt
. Consider it as deprecated. Reason is thatcharAt
cannot deal with Unicode code points outside of the Basic Multilingual Plane as it's a 16-bit value. Instead, use codePointAt() which returns an integer. In addition this removes the need for the cast to int that you are currently doing. -
Stijn de Witt over 9 yearsKeep in mind that
length()
returns the number of chars so if you usecodePointAt
you need to use codePointCount():badFileName.codePointCount(0, badFileName.length());
-
Stijn de Witt over 9 yearsMmm you are also appending wrong... I'll post updated code with correct Unicode handling in a separate answer.
-
weaknespase over 9 yearsYou can use standard functions and work with chars - you just need to skip character that follows surrogate pair character. Also chars don't ever need to be casted to numeric types - they are numeric by design.
-
Franz Kafka over 6 years@Dirk Downvoted because regex is not the solution here. What if the filenames are in multiple languages?
-
D-rk over 6 yearsit depends on the actual requirements. if whitelisting characters is sufficient, this is solution is much more readable.
-
Tony BenBrahim over 6 yearswithout SystemUtils: if( File.separatorChar=='/') { return name.replaceAll( "/+", "" ).trim(); }
-
Arie about 5 yearsTo preserve non-latin characters in the filename, you can use the unicode flag (since Java 1.7) as follows:
String sane = filename.replaceAll("(?U)[^\\w\\._]+", "_") ;
-
Laurent Grégoire about 5 yearsOuch. Clever, but don't use that if you require a fast solution (try/catch and recursion). Also if you accept user input from the web, do not forget to trim the input; otherwise posting a filename 1Mb long full of invalid chars would stack-overflow your server for sure ;)
-
Doddie over 4 yearsI have read both the top answer and this one, and this one appears to be more carefully considered...however I cannot find any case where this code performs correctly and the other one doesn't. What input demonstrates the difference?
-
ax. about 4 yearsIs \u0000 allowed in filenames?