identify files with non-ASCII or non-printable characters in file name
Solution 1
Assuming that "foreign" means "not an ASCII character", then you can use find
with a pattern to find all files not having printable ASCII characters in their names:
LC_ALL=C find . -name '*[! -~]*'
(The space is the first printable character listed on http://www.asciitable.com/, ~
is the last.)
The hint for LC_ALL=C
is required (actually, LC_CTYPE=C
and LC_COLLATE=C
), otherwise the character range is interpreted incorrectly. See also the manual page glob(7)
. Since LC_ALL=C
causes find
to interpret strings as ASCII, it will print multi-byte characters (such as π
) as question marks. To fix this, pipe to some program (e.g. cat
) or redirect to file.
Instead of specifying character ranges, [:print:]
can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.
Example:
$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/ foo foo.c xrestop-0.4/ xrestop-0.4.tar.gz π
$ find -name '*[! -~]*' # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π
Solution 2
If you translate each file name using tr -d '[\200-\377]'
and compare it with the original name, then any file names that have special characters will not be the same.
(The above assuming that you mean non-ASCII with foreign)
Solution 3
You can use tr
to delete any foreign character from a filename and
compare the result with the original filename to see if it contained
foreign characters.
find . -type f > filenames
while read filename; do
stripped="$(printf '%s\n' "$filename" | tr -d -C '[[:alnum:]][[:space:]][[:punct:]]')"
test "$filename" = "$stripped" || printf '%s\n' "$filename";
done < filenames
Solution 4
The accepted answer is helpful, but if your filenames are already in the encoding specified in LANG
/LC_CTYPE
, it's better to just do:
LC_COLLATE=C find . -name '*[! -~]*'
Character classes are affected by LC_CTYPE
, but the above command does not use character classes, only ranges, so LC_CTYPE
just prevents the unusual characters from being replaced by question marks.
Related videos on Youtube
![suspectus](https://i.stack.imgur.com/KHa84.png?s=256&g=1)
suspectus
Linux software engineer for my sins. Comfortable with any *nix green screen. When staring at a screen for long periods, consider practicing the 20-20-20 rule: it helps reduces eye strain and mental fatigue. The picture is of my cat Quincey. I don't know why pets are fed pet food. It really is quite vile and is probably harmful to the environment. It's not fit for landfill and the risks of packing it in a rocket to discard into space are too scary to contemplate.
Updated on September 18, 2022Comments
-
suspectus almost 2 years
In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:
- An easy way to list or otherwise identify these file names?
- A way to generate printable non-English language characters - those characters that are not listed in the printable range of
man ascii
(so I can test that these files are being identified)?
-
Timo over 10 yearsthat is a nice extension to my answer, but it is too simple, file names can have newlines in them and then your script will not work
-
Lekensteyn over 10 yearsIf you want to post-process
find
output, use NUL-terminated output/input as shown in this answer. -
Stéphane Chazelas over 10 yearsThat also removes
[
and]
in mosttr
implementations. -
Lekensteyn over 10 yearsBe aware that you have file names that are using foreign character sets that are incompatible with UTF-8 or ASCII. In those cases, you may see question marks instead of characters.
-
Stéphane Chazelas over 10 years+1, but I would use
LC_ALL=C
instead ofLC_COLLATE=C
as it's doesn't make much sense to set LC_COLLATE to C without settingLC_CTYPE
and to make sure it still works even when the LC_ALL variable is in the environment. -
Stéphane Chazelas over 10 yearsIf
SPC
is printable, then what aboutTAB
andLF
which are also typically found in text files? -
suspectus over 10 yearsYes - it did remove
[
and]
on my system. -
suspectus over 10 years+1 - the solution did find all the (six) file names with non ASCII symbols (in addition to the
[
and]
s). thanks. -
suspectus over 10 yearsThanks - this found six files, which had long hyphen, short hyphen and a variant of single quote. These had all originated from MS Word. No difference in the files listed between LC_ALL and LC_COLLATE. LC_COLLATE displayed the non-ASCII chars correctly whereas LC_ALL displayed ??? instead. Excellent answer!
-
Lekensteyn over 10 years@suspectus I updated by answer based on suggestions from Stephane. For
LC_COLLATE
andLC_CTYPE
, see also thefind(1)
manpage.