how to detect invalid utf8 unicode/binary in a text file

91,642

Solution 1

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

  • -a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
  • -v, --invert-match: inverts the output showing lines not matched
  • -x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

Solution 2

I would grep for non ASCII characters.

With GNU grep with pcre (due to -P, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:

grep -P "[\x80-\xFF]" file

Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
#        ^
#        silent grep

To remove these characters, you can use:

sed -i.bak 's/[\d128-\d255]//g' file

This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.

Solution 3

Try this, in order to find non-ASCII characters from the shell.

Command:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt

Output:

2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

Solution 4

What you are looking at is by definition corrupted. Apparently, you are displaying the file as it is rendered in Latin-1; the three characters � represent the three byte values 0xEF 0xBF 0xBD. But those are the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER U+FFFD which is the result of attempting to convert bytes from an unknown or undefined encoding into UTF-8, and which would properly be displayed as � (if you have a browser from this century, you should see something like a black diamond with a question mark in it; but this also depends on the font you are using etc).

So your question about "how to detect" this particular phenomenon is easy; the Unicode code point U+FFFD is a dead giveaway, and the only possible symptom from the process you are implying.

These are not "invalid Unicode" or "invalid UTF-8" in the sense that this is a valid UTF-8 sequence which encodes a valid Unicode code point; it's just that the semantics of this particular code point is "this is a replacement character for a character which could not be represented properly", i.e. invalid input.

As for how to prevent it in the first place, the answer is really simple, but also rather uninformative -- you need to identify when and how the incorrect encoding took place, and fix the process which produced this invalid output.

To just remove the U+FFFD characters, try something like

perl -CSD -pe 's/\x{FFFD}//g' file

but again, the proper solution is to not generate these erroneous outputs in the first place.

(You are not revealing the encoding of your example data. It is possible that it has an additional corruption. If what you are showing us is a copy/paste of the UTF-8 rendering of the data, it has been "double-encoded". In other words, somebody took -- already corrupted, as per the above -- UTF-8 text and told the computer to convert it from Latin-1 to UTF-8. Undoing that is easy; just convert it "back" to Latin-1. What you obtain should then be the original UTF-8 data before the superfluous incorrect conversion.)

Solution 5

This Perl program should remove all non-ASCII characters:

 foreach $file (@ARGV) {
   open(IN, $file);
   open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
   while (<IN>) {
     s/[^[:ascii:]]//g;
     print OUT "$_";
   }
   rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}

What this does is it takes files as input on the command-line, like so:
perl fixutf8.pl foo bar baz
Then, for each line, it replaces each instance of a non-ASCII character with nothing (deletion).
It then writes this modified line out to super-temporary-utf8-replacement-file-which-should-never-be-used-EVER (named so it dosen't modify any other files.)
Afterwards, it renames the temporary file to that of the original one.

This accepts ALL ASCII characters (including DEL, NUL, CR, etc.), in case you have some special use for them. If you want only printable characters, simply replace :ascii: with :print: in s///.

I hope this helps! Please let me know if this wasn't what you were looking for.

Share:
91,642
user121196
Author by

user121196

Updated on January 14, 2020

Comments

  • user121196
    user121196 over 4 years

    I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.

    �>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
    

    what I have tried:

    iconv -f utf-8 -t utf-8 -c file.csv 
    

    this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

  • Mike S
    Mike S about 9 years
    The sed command has a couple of issues: For many versions of sed (e.g. CentOS 7, sed 4.2.2), putting the notation \dNNN does not work inside the brackets. Usable one-liners are the perl example from stackoverflow.com/questions/3337936/… or the tr from stackoverflow.com/questions/15034944/… . Also, the search and replace is missing the "g" at the end; as expressed (and if it worked) it would only get the first non-ascii character per "line" in the file.
  • humanityANDpeace
    humanityANDpeace about 7 years
    this code helped me. Some short discritption about the used options would have helped me more directly. Used options here: -a treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8), -v inverts the output showing lines not matched, finally -x '.*' means to match a complete line consisting of any utf8 character. Hence there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v).
  • Russell Silva
    Russell Silva about 7 years
    How do I tell if my locale is set to UTF-8?
  • Blaf
    Blaf about 7 years
    Check output of locale.
  • Janus Troelsen
    Janus Troelsen almost 7 years
    -P doesn't work in FreeBSD. Users beware they may need GNU Grep.
  • Janus Troelsen
    Janus Troelsen almost 7 years
    Actually FreeBSD uses gnu grep too , just not with pcre. So I clarified the edit
  • tripleee
    tripleee over 6 years
    At the risk of restating the already obvious, there are millions of valid UTF-8 sequences which are not ASCII. Removing those is clearly the wrong solution.
  • tripleee
    tripleee over 6 years
    Ths strips out any valid UTF-8 sequences. It will abort on invalid UTF-8 input.
  • sjas
    sjas about 6 years
    99% of the time you want exactly this solution. thank you, sir.
  • Badashi
    Badashi about 4 years
    For future reference, if someone needs to use this to search which files are corrupted in a folder(for example, .java files): ugrep -e "." -N "\p{Unicode}" **/*.java should print a list of files that match the expression(and are, therefore, corrupted)
  • GTodorov
    GTodorov about 4 years
    Thanks! That saved me a bit of a headache. +1 for "-a" parameter!
  • Bernhard Döbler
    Bernhard Döbler over 3 years
    I just used ugrep.exe as provided for Windows users on their github page. To use a glob to identify files I had to use the -g switch
  • Jill-Jênn Vie
    Jill-Jênn Vie over 3 years
    It seems really exciting but I get $: command not found!
  • Chris Johnson
    Chris Johnson over 3 years
    Non-ASCII isn't the same as non-utf8; OP asked about utf8.
  • Chris L. Barnes
    Chris L. Barnes over 3 years
    Which part of the locale output is used by grep to determine encoding? Could you suggest a command like CORRECT_VAR_NAME=en_US.UTF-8 grep -axv '.*' file.txt?