grep regex .* not matching everything

5,797

This is a typical char encoding problem. . means any character. But which sequence of byte is a legal character is a matter of encoding. Dealing with text without the knowledge of the encoding is a sure failure. Your grep command probably expect UTF-8 encoded string. UTF-8 is a multibyte encoding, meaning that some char are represented by multiple bytes. However, not all sequence of bytes are valid. See, for example, the Wikipedia article on UTF-8.

When grep encounters a byte sequence that is not a valid char in the expected encoding, it cannot recognise it as a character, the line doesn't match, it's output. Since your terminal doesn't recognise the char either, you get a .

There is a workaround in your case. Tell grep not to bother about encoding, and consider one byte as one char.

env LANG=C grep ....

or maybe

env LANG=C LC_ALL=C grep ....

You may test easily:

Create 2 files, one utf-8 encoded, one utf-16-be:

$ echo éléphant | tee file.std | iconv -f utf8 -t utf16be >file.utf16be

Check content of files:

$ cat file*
éléphant
�l�phant

Try to grep. The utf16be string is not recognised, no output:

$ grep '^.*$' file*
file.std:éléphant

Don't use encoding at all. One byte is one char. all strings are matched the � just means the terminal doesn't recognise the utf16be sequence as a valid utf-8 char. Note the use of -a to tell grep to consider binary is is some text.

$ env LANG=C grep -a '^.*$' file*
file.std:éléphant
file.utf16be:�l�phant

Alternatively, if you know the encoding, then you can use iconv to first convert your file then use grep. One of the following should work.

iconv -f utf16   -t utf8 < file | grep ...
iconv -f utf16le -t utf8 < file | grep ...
iconv -f utf16be -t utf8 < file | grep ...
Share:
5,797

Related videos on Youtube

Liam Baker
Author by

Liam Baker

My personal website

Updated on September 18, 2022

Comments

  • Liam Baker
    Liam Baker over 1 year

    I've recently gotten into using tools like grep, wc, cat, etc. because I have to deal with some very large CSV files (>10GB) which aren't quite delimited correctly (for instance, having occurrences of the delimiter character inside some of the fields.

    In my working with one of these files, I've run the following command in the process of trying to figure out a way to correctly identify which instances of ; is a delimiter and replace them with some other character:

    grep -v -n --text "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]" < Transactions.csv
    

    The regex can probably be done much better, but anyway; what is surprising is that, among others, the above code outputs the following line:

    12345678:2016-10-25;12345678912345;2016-10-25;gobbledegook �IDNR: 69 ;12345.67;.00;2003-09-05;12345678;2003-09-03;stuff stuff         ;12345 fgadfkjgbsdkb;12/3/45678/9
    

    (as this was actually transaction data, I've changed most of the fields' values, except for the offending ) Maybe I'm being silly, but why doesn't the above regex match that line? It seems like the regex .* somehow doesn't match that character for some reason.

    I suspect that the file is saved using the UTF-16 encoding, if that makes any difference.

    Edit: Thanks to @exore for the answer. As it turns out, my file was encoded in ISO-8859-15, which I was able to figure by grepping out the lines containing special characters, which were relatively few, into a file and opening that in gedit. I then used iconv to convert that to utf8, after which it worked fine!

    • muru
      muru over 7 years
      Run file on it, it should tell you if the file is actually UTF-16.
    • Liam Baker
      Liam Baker over 7 years
      @muru, I ran file on it but it says it's ASCII. From what I found on the Interwebs it seems that file only scans the first few lines of the file, so I guess that the offending characters appear much later in the file.
    • wjandrea
      wjandrea over 7 years
      @LiamBaker BTW, from a quick look at some PCRE docs, I think you can make your regex easier to read by replacing things like [0-9][0-9][0-9][0-9] with [0-9]{4}, and -[0-9][0-9]-[0-9][0-9] with (-[0-9]{2}){2}.