Find Non-UTF8 Filenames on Linux File System

11,306

Solution 1

convmv might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).

Solution 2

find . | perl -ne 'print if /[^[:ascii:]]/'

Solution 3

find . | egrep [^a-zA-Z0-9_./-\s]

Danger, shell escaping!

bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.

Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:

import os.path
def walk(dir):
    for child in os.listdir(dir):
        child= os.path.join(dir, child)
        if os.path.isdir(child):
            for descendant in walk(child):
                yield descendant
        yield child

for path in walk('.'):
    try:
        u= unicode(path, 'utf-8')
    except UnicodeError:
        # print path, or attempt to rename file
Share:
11,306
Admin
Author by

Admin

Updated on July 28, 2022

Comments

  • Admin
    Admin almost 2 years

    I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.

    I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)

    I have been trying to following but no luck.

    find . | egrep [^a-zA-Z0-9_\.\/\-\s]
    

    I'm using Fedora Code 9.

  • Arafangion
    Arafangion almost 14 years
    Single quotes would be better, in that context.
  • sl0815
    sl0815 over 7 years
    I had 1000+ files with Windows 1252 encoding and lots of umlauts. "convmv -r -f cp1252 -t utf8 --notest ." worked for me. Didn't know there was such a program. Thanks!
  • Emiter
    Emiter over 4 years
    if something is not ascii it doesn't men it is not utf.
  • Emiter
    Emiter about 4 years
    Example: emil@galeon:/tmp/expermients$ ls laka.txt łąka.txt emil@galeon:/tmp/expermients$ find . | perl -ane '{ if(m/[[:^ascii:]]/) { print } }' ./łąka.txt` And "łąka.txt" is proper utf8 encoded name.
  • Chris L. Barnes
    Chris L. Barnes over 3 years
    If this disallows UTF-8, isn't it completely useless for OP's request? They're trying to disallow non UTF-8 filenames.