Grep is not matching non-ascii characters
e4 75
is indeed an illegal utf8 sequence. In utf8, a byte with the highest nibble equal to 0xe introduces a three byte sequence. The second byte of such a sequence cannot be 0x75, because the high order nibble of that second byte (0x7) is not between 0x8 and 0xb.
This explains why iconv rejects that file as invalid utf8. Perhaps it's already iso8859-1?
For a summary of utf8 encoding, consult this wikipedia table
As for your grep issue, perhaps if you specify the C/POSIX locale, where characters are equivalent to bytes:
LC_ALL=C grep -P -n '[^\x00-\x7F]' corrupt_part.txt
Using an old Ubuntu system, GNU grep, and an environment using the en_US.UTF-8 locale:
$ od -h bytes
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ LC_ALL=C grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
Related videos on Youtube
bioslime
Updated on September 18, 2022Comments
-
bioslime over 1 year
I found a problematic sequence of a supposedly UTF-8 encoded text file. The strange thing is that grep seems unable to match this non-ASCII line.
$ iconv -f utf8 -t iso88591 corrupt_part.txt --output corrupt_part.txt.conv iconv: illegal input sequence at position 8 $ cat corrupt_part.txt Oberallg�u $ grep -P -n '[^\x00-\x7F]' corrupt_part.txt $ od -h corrupt_part.txt 0000000 624f 7265 6c61 676c 75e4 0a20 0000014
So
\xe4
is e.g.ä
in the extended ASCII set. Yet, filtering on the control and printable characters (ascii range) the grep command above should match the\xe4
character. Why am I not getting any grep output? -
bioslime about 8 yearsLC_ALL=C indeed yields the desired outcome! It did not appear to me that even when grepping hex values this is critical (due to bytewise vs. multibyte sequences).