Create (sane/safe) filename from any (unsafe) string

45,747

Solution 1

Python:

"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()

this accepts Unicode characters but removes line breaks, etc.

example:

filename = u"ad\nbla'{-+\)(ç?"

gives: adblaç

edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

    keepcharacters = (' ','.','_')
    "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()

Solution 2

My requirements were conservative ( the generated filenames needed to be valid on multiple operating systems, including some ancient mobile OSs ). I ended up with:

    "".join([c for c in text if re.match(r'\w', c)])

That white lists the alphanumeric characters ( a-z, A-Z, 0-9 ) and the underscore. The regular expression can be compiled and cached for efficiency, if there are a lot of strings to be matched. For my case, it wouldn't have made any significant difference.

Solution 3

More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):

>>> import re
>>> filename = u"ad\nbla'{-+\)(ç1?"
>>> re.sub(r'[^\w\d-]','_',filename)
u'ad_bla__-_____1_'

Solution 4

There are a few reasonable answers here, but in my case I want to take something which is a string which might have spaces and punctuation and rather than just removing those, i would rather replace it with an underscore. Even though spaces are an allowable filename character in most OS's they are problematic. Also, in my case if the original string contained a period I didn't want that to pass through into the filename, or it would generate "extra extensions" that I might not want (I'm appending the extension myself)

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum():
            return c
        else:
            return "_"
    return "".join(safe_char(c) for c in s).rstrip("_")

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : die!!!" ) + ".gif")

prints:

hello_you_crazy_______2579_people______die___.gif

Solution 5

No solutions here, only problems that you must consider:

  • what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)

  • what filenames are forbidden in some context? (Windows still doesn't support saving a file as CON.TXT -- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)

  • remember that . and .. have specific meanings (current/parent directory) and are therefore unsafe.

  • is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?

Consider just hashing the data and using the hexdump of that as a filename?

Share:
45,747
Albert
Author by

Albert

I am postgraduate of RWTH Aachen, Germany and received a M.S. Math and a M.S. CompSci. My main interests are Machine Learning, Neural Networks, Artificial Intelligence, Logic, Automata Theory and Programming Languages. And I'm an enthusiastic hobby programmer with a wide range of side projects, mostly in C++ and Python. Homepage GitHub SourceForge HackerNewsers profile page MetaOptimize Q+A

Updated on January 10, 2022

Comments

  • Albert
    Albert over 2 years

    I want to create a sane/safe filename (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (which might contain just anything).

    (It doesn't matter for me whether the function is Cocoa, ObjC, Python, etc.)


    Of course, there might be infinite many characters which might be strange. Thus, it is not really a solution to have a blacklist and to add more and more to that list over the time.

    I could have a whitelist. However, I don't really know how to define it. [a-zA-Z0-9 .] is a start but I also want to accept unicode chars which can be displayed in a normal way.