Find files by character encoding
Solution 1
Using isutf8
from the moreutils
package:
find . -name '*.py' -exec isutf8 {} +
Or:
find . -name '*.py' | xargs isutf8
(Latter under the assumption that the file names have no newlines.)
Solution 2
To build a similarly failing file, we can use this script:
{ printf '%*s' "179"; printf '\x81'; printf '%*s' "20"; } >infile
Then this command will print at which position the file fails:
$ isutf8 infile
infile: line 1, char 1, byte offset 180: invalid UTF-8 code
So, this will test all python (.py
) files in the pwd for an invalid code at position 180:
$ isutf8 ./*.py | grep "offset 180"
Or even more flexible, a range of offsets (gnu extended regex):
$ isutf8 ./*.py | grep -E "offset (17|18)"
Or, an specific test for files inside the whole directory:
$ find . -iname "*.py" -type f -exec bash -c 'isutf8 "$1" | grep -E "offset (17|18)"' Find {} \;
Related videos on Youtube
Filip Haglund
Polyglot programmer looking for big responsibilities in small teams. Enjoys everything large, distributed and parallel. Prefers simple and declarative programming languages (functonal, logic). Truly believes in the right tool for the right job. Startup mentality; always looking for the simplest thing that could possibly work, but also likes preparing for the future. I've seen too many people run their projects into the ground by not thinking ahead. Open to relocate almost anywhere, but prefers working remote. That's how I get more things done.
Updated on September 18, 2022Comments
-
Filip Haglund over 1 year
I have a long-running python script that failed to utf-8 decode a file. The error message doesn't tell me what file it failed on, only that it couldn't decode byte
0x81
in position194
. I know which folder the file is in, but not where among the thousands of files somewhere in that subtree. What are my options for finding this file (and others like it)? Is there a pretty one-liner in bash for this?Changing the script to print what it's looking at and rerunning it, fixing one file at a time, is barely an option, as running the script once takes many hours. Writing a directory traverser in Python seems like a bit too much work.
-
terdon over 7 yearsIs it enough to just print all file names that aren't UTF-8? Does
for f in *; do file "$f" | grep -q UTF-8 || echo $f; done
do what you want? -
Filip Haglund over 7 yearsFile seems to not recognize a lot of python source code as utf-8 (or even python for that matter). What I would need is to find files that cannot be parsed as utf-8, which includes ascii.
-
terdon over 7 yearsWhy would the python sources be utf8? And it should recognize them as python if they have a valid shebang line. In any case, that command will return all files that aren't utf8 encoded, so yes, including ascii. Isn't that what you want?
-
Filip Haglund over 7 yearsI need a list of all (python) files that cannot be decoded as utf-8, so ascii is fine. Your script finds directories and other file types as well, including many python files that are missing a shebang. Also, I need to recurse into subdirectories.
**/*.py
should do the trick, but doesn't solve the other half of the problem. I suspect there's some python files with CP-1252 or Latin1 encoding somewhere in there. -
n.st over 7 yearsYou can probably run
iconv
on all files and check where it fails.
-