Grep is not matching non-ascii characters

6,029

e4 75 is indeed an illegal utf8 sequence. In utf8, a byte with the highest nibble equal to 0xe introduces a three byte sequence. The second byte of such a sequence cannot be 0x75, because the high order nibble of that second byte (0x7) is not between 0x8 and 0xb.

This explains why iconv rejects that file as invalid utf8. Perhaps it's already iso8859-1?

For a summary of utf8 encoding, consult this wikipedia table

As for your grep issue, perhaps if you specify the C/POSIX locale, where characters are equivalent to bytes:

LC_ALL=C grep -P -n '[^\x00-\x7F]' corrupt_part.txt

Using an old Ubuntu system, GNU grep, and an environment using the en_US.UTF-8 locale:

$ od -h bytes
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ LC_ALL=C grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
Share:
6,029

Related videos on Youtube

bioslime
Author by

bioslime

Updated on September 18, 2022

Comments

  • bioslime
    bioslime over 1 year

    I found a problematic sequence of a supposedly UTF-8 encoded text file. The strange thing is that grep seems unable to match this non-ASCII line.

    $ iconv -f utf8 -t iso88591 corrupt_part.txt --output corrupt_part.txt.conv
    iconv: illegal input sequence at position 8
    $ cat corrupt_part.txt
    Oberallg�u
    $ grep -P -n '[^\x00-\x7F]' corrupt_part.txt
    $ od -h corrupt_part.txt
    0000000 624f 7265 6c61 676c 75e4 0a20
    0000014
    

    So \xe4 is e.g. ä in the extended ASCII set. Yet, filtering on the control and printable characters (ascii range) the grep command above should match the \xe4 character. Why am I not getting any grep output?

  • bioslime
    bioslime about 8 years
    LC_ALL=C indeed yields the desired outcome! It did not appear to me that even when grepping hex values this is critical (due to bytewise vs. multibyte sequences).