Find Non-UTF8 Filenames on Linux File System
Solution 1
convmv
might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).
Solution 2
find . | perl -ne 'print if /[^[:ascii:]]/'
Solution 3
find . | egrep [^a-zA-Z0-9_./-\s]
Danger, shell escaping!
bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.
Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:
import os.path
def walk(dir):
for child in os.listdir(dir):
child= os.path.join(dir, child)
if os.path.isdir(child):
for descendant in walk(child):
yield descendant
yield child
for path in walk('.'):
try:
u= unicode(path, 'utf-8')
except UnicodeError:
# print path, or attempt to rename file
Admin
Updated on July 28, 2022Comments
-
Admin almost 2 years
I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.
I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)
I have been trying to following but no luck.
find . | egrep [^a-zA-Z0-9_\.\/\-\s]
I'm using Fedora Code 9.
-
Arafangion almost 14 yearsSingle quotes would be better, in that context.
-
sl0815 over 7 yearsI had 1000+ files with Windows 1252 encoding and lots of umlauts. "convmv -r -f cp1252 -t utf8 --notest ." worked for me. Didn't know there was such a program. Thanks!
-
Emiter over 4 yearsif something is not ascii it doesn't men it is not utf.
-
Emiter about 4 yearsExample:
emil@galeon:/tmp/expermients$ ls
laka.txt łąka.txt
emil@galeon:/tmp/expermients$ find . | perl -ane '{ if(m/[[:^ascii:]]/) { print } }'
./łąka.txt` And "łąka.txt" is proper utf8 encoded name. -
Chris L. Barnes over 3 yearsIf this disallows UTF-8, isn't it completely useless for OP's request? They're trying to disallow non UTF-8 filenames.