Is there a cross-platform Java method to remove filename special chars?

34,089

Solution 1

As suggested elsewhere, this is not usually what you want to do. It is usually best to create a temporary file using a secure method such as File.createTempFile().

You should not do this with a whitelist and only keep 'good' characters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use a whitelist for this reason, we have to use a blacklist.

Linux pretty much allows anything which can be a real pain. I would just limit Linux to the same list that you limit Windows to so you save yourself headaches in the future.

Using this C# snippet on Windows I produced a list of characters that are not valid on Windows. There are quite a few more characters in this list than you may think (41) so I wouldn't recommend trying to create your own list.

        foreach (char c in new string(Path.GetInvalidFileNameChars()))
        {
            Console.Write((int)c);
            Console.Write(",");
        }

Here is a simple Java class which 'cleans' a file name.

public class FileNameCleaner {
final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};
static {
    Arrays.sort(illegalChars);
}
public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    for (int i = 0; i < badFileName.length(); i++) {
        int c = (int)badFileName.charAt(i);
        if (Arrays.binarySearch(illegalChars, c) < 0) {
            cleanName.append((char)c);
        }
    }
    return cleanName.toString();
}
}

EDIT: As Stephen suggested you probably also should verify that these file accesses only occur within the directory you allow.

The following answer has sample code for establishing a custom security context in Java and then executing code in that 'sandbox'.

How do you create a secure JEXL (scripting) sandbox?

Solution 2

or just do this:

String filename = "A20/B22b#öA\\BC#Ä$%ld_ma.la.xps";
String sane = filename.replaceAll("[^a-zA-Z0-9\\._]+", "_");

Result: A20_B22b_A_BC_ld_ma.la.xps

Explanation:

[a-zA-Z0-9\\._] matches a letter from a-z lower or uppercase, numbers, dots and underscores

[^a-zA-Z0-9\\._] is the inverse. i.e. all characters which do not match the first expression

[^a-zA-Z0-9\\._]+ is a sequence of characters which do not match the first expression

So every sequence of characters which does not consist of characters from a-z, 0-9 or . _ will be replaced.

Solution 3

This is based on the accepted answer by Sarel Botha which works fine as long as you don't encounter any characters outside of the Basic Multilingual Plane. If you need full Unicode support (and who doesn't?) use this code instead which is Unicode safe:

public class FileNameCleaner {
  final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};

  static {
    Arrays.sort(illegalChars);
  }

  public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    int len = badFileName.codePointCount(0, badFileName.length());
    for (int i=0; i<len; i++) {
      int c = badFileName.codePointAt(i);
      if (Arrays.binarySearch(illegalChars, c) < 0) {
        cleanName.appendCodePoint(c);
      }
    }
    return cleanName.toString();
  }
}

Key changes here:

  • Use codePointCount i.c.w. length instead of just length
  • use codePointAt instead of charAt
  • use appendCodePoint instead of append
  • No need to cast chars to ints. In fact, you should never deal with chars as they are basically broken for anything outside the BMP.

Solution 4

Here is the code I use:

public static String sanitizeName( String name ) {
    if( null == name ) {
        return "";
    }

    if( SystemUtils.IS_OS_LINUX ) {
        return name.replaceAll( "[\u0000/]+", "" ).trim();
    }

    return name.replaceAll( "[\u0000-\u001f<>:\"/\\\\|?*\u007f]+", "" ).trim();
}

SystemUtils is from Apache commons-lang3

Solution 5

There's a pretty good built-in Java solution - Character.isXxx().

Try Character.isJavaIdentifierPart(c):

String name = "name.é+!@#$%^&*(){}][/=?+-_\\|;:`~!'\",<>";
StringBuilder filename = new StringBuilder();

for (char c : name.toCharArray()) {
  if (c=='.' || Character.isJavaIdentifierPart(c)) {
    filename.append(c);
  }
}

Result is "name.é$_".

Share:
34,089
Ben S
Author by

Ben S

Mobile Software Engineering Manager at Square currently working on Cash App iOS with a University of Waterloo bachelor's degree in Computer Science, Software Engineering Option. Previous experience at Google, Amazon.com, OpenText, Research In Motion, Sybase and Bridgewater Systems.

Updated on July 08, 2022

Comments

  • Ben S
    Ben S almost 2 years

    I'm making a cross-platform application that renames files based on data retrieved online. I'd like to sanitize the Strings I took from a web API for the current platform.

    I know that different platforms have different file-name requirements, so I was wondering if there's a cross-platform way to do this?

    Edit: On Windows platforms you cannot have a question mark '?' in a file name, whereas in Linux, you can. The file names may contain such characters and I would like for the platforms that support those characters to keep them, but otherwise, strip them out.

    Also, I would prefer a standard Java solution that doesn't require third-party libraries.

  • THelper
    THelper over 12 years
    Good java example, but why didn't you include the forward slash (47)?
  • Sarel Botha
    Sarel Botha over 12 years
    No idea why it's not in the list. We actually just ran into this problem in production code. I've fixed the answer to include 47. Thanks.
  • Mark D
    Mark D over 11 years
    okay, so it's a conservative way and doesn't meet the original question fully (cross-platform), but worked for me :)
  • Jaime Hablutzel
    Jaime Hablutzel about 11 years
    It does remove hyphen which is valid for filenames (at least in Windows) but it does the job, anyway I think Apache Commons FilenameUtils should incorporate a cross platform way to get this done
  • Franz Kafka
    Franz Kafka almost 11 years
    The illegalChars array has to be sorted for binarySearch to work properly. Please add Arrays.sort(illegalChars) or change the array to "{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 42, 47, 58, 60, 62, 63, 92, 124}"
  • Sarel Botha
    Sarel Botha over 10 years
    This works on a file name that uses only English letters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use whitelists on strings to strip bad characters for this reason, we have to use blacklists.
  • D-rk
    D-rk over 10 years
    Have a look here: stackoverflow.com/questions/9576384/… it should work if you use Java 7
  • azerafati
    azerafati about 10 years
    also it removes "@" too which is again valid in Windows.
  • Stijn de Witt
    Stijn de Witt over 9 years
    Your solution uses charAt()... Basically you should never use charAt. Consider it as deprecated. Reason is that charAt cannot deal with Unicode code points outside of the Basic Multilingual Plane as it's a 16-bit value. Instead, use codePointAt() which returns an integer. In addition this removes the need for the cast to int that you are currently doing.
  • Stijn de Witt
    Stijn de Witt over 9 years
    Keep in mind that length() returns the number of chars so if you use codePointAt you need to use codePointCount(): badFileName.codePointCount(0, badFileName.length());
  • Stijn de Witt
    Stijn de Witt over 9 years
    Mmm you are also appending wrong... I'll post updated code with correct Unicode handling in a separate answer.
  • weaknespase
    weaknespase over 9 years
    You can use standard functions and work with chars - you just need to skip character that follows surrogate pair character. Also chars don't ever need to be casted to numeric types - they are numeric by design.
  • Franz Kafka
    Franz Kafka over 6 years
    @Dirk Downvoted because regex is not the solution here. What if the filenames are in multiple languages?
  • D-rk
    D-rk over 6 years
    it depends on the actual requirements. if whitelisting characters is sufficient, this is solution is much more readable.
  • Tony BenBrahim
    Tony BenBrahim over 6 years
    without SystemUtils: if( File.separatorChar=='/') { return name.replaceAll( "/+", "" ).trim(); }
  • Arie
    Arie about 5 years
    To preserve non-latin characters in the filename, you can use the unicode flag (since Java 1.7) as follows: String sane = filename.replaceAll("(?U)[^\\w\\._]+", "_") ;
  • Laurent Grégoire
    Laurent Grégoire about 5 years
    Ouch. Clever, but don't use that if you require a fast solution (try/catch and recursion). Also if you accept user input from the web, do not forget to trim the input; otherwise posting a filename 1Mb long full of invalid chars would stack-overflow your server for sure ;)
  • Doddie
    Doddie over 4 years
    I have read both the top answer and this one, and this one appears to be more carefully considered...however I cannot find any case where this code performs correctly and the other one doesn't. What input demonstrates the difference?
  • ax.
    ax. about 4 years
    Is \u0000 allowed in filenames?