Find files by character encoding

5,505

Solution 1

Using isutf8 from the moreutils package:

find . -name '*.py' -exec isutf8 {} +

Or:

find . -name '*.py' | xargs isutf8

(Latter under the assumption that the file names have no newlines.)

Solution 2

To build a similarly failing file, we can use this script:

{ printf '%*s' "179"; printf '\x81'; printf '%*s' "20"; } >infile

Then this command will print at which position the file fails:

$ isutf8 infile 
infile: line 1, char 1, byte offset 180: invalid UTF-8 code

So, this will test all python (.py) files in the pwd for an invalid code at position 180:

$ isutf8 ./*.py | grep "offset 180"

Or even more flexible, a range of offsets (gnu extended regex):

$ isutf8 ./*.py | grep -E "offset (17|18)"

Or, an specific test for files inside the whole directory:

$ find . -iname "*.py" -type f -exec bash -c 'isutf8 "$1" | grep -E "offset (17|18)"' Find {} \;
Share:
5,505

Related videos on Youtube

Filip Haglund
Author by

Filip Haglund

Polyglot programmer looking for big responsibilities in small teams. Enjoys everything large, distributed and parallel. Prefers simple and declarative programming languages (functonal, logic). Truly believes in the right tool for the right job. Startup mentality; always looking for the simplest thing that could possibly work, but also likes preparing for the future. I've seen too many people run their projects into the ground by not thinking ahead. Open to relocate almost anywhere, but prefers working remote. That's how I get more things done.

Updated on September 18, 2022

Comments

  • Filip Haglund
    Filip Haglund over 1 year

    I have a long-running python script that failed to utf-8 decode a file. The error message doesn't tell me what file it failed on, only that it couldn't decode byte 0x81 in position 194. I know which folder the file is in, but not where among the thousands of files somewhere in that subtree. What are my options for finding this file (and others like it)? Is there a pretty one-liner in bash for this?

    Changing the script to print what it's looking at and rerunning it, fixing one file at a time, is barely an option, as running the script once takes many hours. Writing a directory traverser in Python seems like a bit too much work.

    • terdon
      terdon over 7 years
      Is it enough to just print all file names that aren't UTF-8? Does for f in *; do file "$f" | grep -q UTF-8 || echo $f; done do what you want?
    • Filip Haglund
      Filip Haglund over 7 years
      File seems to not recognize a lot of python source code as utf-8 (or even python for that matter). What I would need is to find files that cannot be parsed as utf-8, which includes ascii.
    • terdon
      terdon over 7 years
      Why would the python sources be utf8? And it should recognize them as python if they have a valid shebang line. In any case, that command will return all files that aren't utf8 encoded, so yes, including ascii. Isn't that what you want?
    • Filip Haglund
      Filip Haglund over 7 years
      I need a list of all (python) files that cannot be decoded as utf-8, so ascii is fine. Your script finds directories and other file types as well, including many python files that are missing a shebang. Also, I need to recurse into subdirectories. **/*.py should do the trick, but doesn't solve the other half of the problem. I suspect there's some python files with CP-1252 or Latin1 encoding somewhere in there.
    • n.st
      n.st over 7 years
      You can probably run iconv on all files and check where it fails.