identify files with non-ASCII or non-printable characters in file name

29,693

Solution 1

Assuming that "foreign" means "not an ASCII character", then you can use find with a pattern to find all files not having printable ASCII characters in their names:

LC_ALL=C find . -name '*[! -~]*'

(The space is the first printable character listed on http://www.asciitable.com/, ~ is the last.)

The hint for LC_ALL=C is required (actually, LC_CTYPE=C and LC_COLLATE=C), otherwise the character range is interpreted incorrectly. See also the manual page glob(7). Since LC_ALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π) as question marks. To fix this, pipe to some program (e.g. cat) or redirect to file.

Instead of specifying character ranges, [:print:] can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.

Example:

$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/  foo  foo.c  xrestop-0.4/  xrestop-0.4.tar.gz  π
$ find -name '*[! -~]*'       # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π

Solution 2

If you translate each file name using tr -d '[\200-\377]' and compare it with the original name, then any file names that have special characters will not be the same.

(The above assuming that you mean non-ASCII with foreign)

Solution 3

You can use tr to delete any foreign character from a filename and compare the result with the original filename to see if it contained foreign characters.

find . -type f > filenames
while read filename; do
      stripped="$(printf '%s\n' "$filename" | tr -d -C '[[:alnum:]][[:space:]][[:punct:]]')"
      test "$filename" = "$stripped" || printf '%s\n' "$filename"; 
done < filenames

Solution 4

The accepted answer is helpful, but if your filenames are already in the encoding specified in LANG/LC_CTYPE, it's better to just do:

LC_COLLATE=C find . -name '*[! -~]*'

Character classes are affected by LC_CTYPE, but the above command does not use character classes, only ranges, so LC_CTYPE just prevents the unusual characters from being replaced by question marks.

Share:
29,693

Related videos on Youtube

suspectus
Author by

suspectus

Linux software engineer for my sins. Comfortable with any *nix green screen. When staring at a screen for long periods, consider practicing the 20-20-20 rule: it helps reduces eye strain and mental fatigue. The picture is of my cat Quincey. I don't know why pets are fed pet food. It really is quite vile and is probably harmful to the environment. It's not fit for landfill and the risks of packing it in a rocket to discard into space are too scary to contemplate.

Updated on September 18, 2022

Comments

  • suspectus
    suspectus almost 2 years

    In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:

    • An easy way to list or otherwise identify these file names?
    • A way to generate printable non-English language characters - those characters that are not listed in the printable range of man ascii (so I can test that these files are being identified)?
  • Timo
    Timo over 10 years
    that is a nice extension to my answer, but it is too simple, file names can have newlines in them and then your script will not work
  • Lekensteyn
    Lekensteyn over 10 years
    If you want to post-process find output, use NUL-terminated output/input as shown in this answer.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    That also removes [ and ] in most tr implementations.
  • Lekensteyn
    Lekensteyn over 10 years
    Be aware that you have file names that are using foreign character sets that are incompatible with UTF-8 or ASCII. In those cases, you may see question marks instead of characters.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    +1, but I would use LC_ALL=C instead of LC_COLLATE=C as it's doesn't make much sense to set LC_COLLATE to C without setting LC_CTYPE and to make sure it still works even when the LC_ALL variable is in the environment.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    If SPC is printable, then what about TAB and LF which are also typically found in text files?
  • suspectus
    suspectus over 10 years
    Yes - it did remove [ and ] on my system.
  • suspectus
    suspectus over 10 years
    +1 - the solution did find all the (six) file names with non ASCII symbols (in addition to the [ and ]s). thanks.
  • suspectus
    suspectus over 10 years
    Thanks - this found six files, which had long hyphen, short hyphen and a variant of single quote. These had all originated from MS Word. No difference in the files listed between LC_ALL and LC_COLLATE. LC_COLLATE displayed the non-ASCII chars correctly whereas LC_ALL displayed ??? instead. Excellent answer!
  • Lekensteyn
    Lekensteyn over 10 years
    @suspectus I updated by answer based on suggestions from Stephane. For LC_COLLATE and LC_CTYPE, see also the find(1) manpage.