UTF 8 filenames?

23,074

Solution 1

On Unix/Linux, a filename is a sequence of any bytes except for a slash or a NUL. A slash separates path components, and a NUL terminates a path name.

So, you can use whatever encoding you want for filenames. Some applications may have trouble with some encodings if they are naïve about what characters may be in filenames - for example, poorly-written shell scripts often do not handle filenames with spaces.

Modern Unix/Linux environments handle UTF-8 encoded filenames just fine.

Solution 2

Internally, most filesystems store bytes: the filesystem driver doesn't care about what the bytes mean. The generic filesystem driver on Linux and most other modern unices allows any byte other than / and the null byte to appear in a file name.

There are filesystems that may have encoding constraints — usually non-native filesystems such as FAT or NTFS. Some network filesystems such as Samba may translate between the server encoding and the client encoding; you'll need to make sure that the server and client configurations are coherent.

Conventionally, on most systems, the bytes that make up a file name are interpreted as UTF-8. If you run an application that interprets the file names as characters, for example an application that transmits the names over FTP, you may need to configure this application to tell it that your file names are encoded in UTF-8. Setting the environment LC_CTYPE to a UTF-8 locale like en_US.UTF-8 does the trick for many command-line applications.

If you store files on a system that doesn't support UTF-8, it doesn't matter. The bytes will remain the same. You won't be able to display the characters that make up the file names, but if you copy the files back to a system that supports UTF-8, those same bytes will still display as UTF-8 characters.

If you're writing your own application, using UTF-8 internally and, whenever possible, for storage and transmission is a good idea.

Share:
23,074
Mark D
Author by

Mark D

Updated on September 18, 2022

Comments

  • Mark D
    Mark D over 1 year

    In unix based operating systems are utf6 filenames permissible? If so do I need to do anything special to write the file to disk.

    Let me explain what I'm hoping to do. I'm writing an application that will transfer a file via ftp to a remote system but the filename is dynamically set to via some set of meta data which potentially could be in utf8. I'm wondering if there's something I need to do to write the file to disk in unix/linux.

    Also as a follow up does anyone know what would happen if I did upload a utf 8 filename to a system doesn't support utf8?