Grep is not matching non-ascii characters

grep unicode hex

6,029

e4 75 is indeed an illegal utf8 sequence. In utf8, a byte with the highest nibble equal to 0xe introduces a three byte sequence. The second byte of such a sequence cannot be 0x75, because the high order nibble of that second byte (0x7) is not between 0x8 and 0xb.

This explains why iconv rejects that file as invalid utf8. Perhaps it's already iso8859-1?

For a summary of utf8 encoding, consult this wikipedia table

As for your grep issue, perhaps if you specify the C/POSIX locale, where characters are equivalent to bytes:

LC_ALL=C grep -P -n '[^\x00-\x7F]' corrupt_part.txt

Using an old Ubuntu system, GNU grep, and an environment using the en_US.UTF-8 locale:

$ od -h bytes
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ LC_ALL=C grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014

6,029

bioslime

Updated on September 18, 2022

Comments

bioslime over 1 year
I found a problematic sequence of a supposedly UTF-8 encoded text file. The strange thing is that grep seems unable to match this non-ASCII line.
```
$ iconv -f utf8 -t iso88591 corrupt_part.txt --output corrupt_part.txt.conv
iconv: illegal input sequence at position 8
$ cat corrupt_part.txt
Oberallg�u
$ grep -P -n '[^\x00-\x7F]' corrupt_part.txt
$ od -h corrupt_part.txt
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
```
So \xe4 is e.g. ä in the extended ASCII set. Yet, filtering on the control and printable characters (ascii range) the grep command above should match the \xe4 character. Why am I not getting any grep output?
bioslime about 8 years

LC_ALL=C indeed yields the desired outcome! It did not appear to me that even when grepping hex values this is critical (due to bytewise vs. multibyte sequences).