Matching string with a fixed number of characters using grep

49,701

Solution 1

grep's idea of a character is locale-dependent. If you're in a non-Unicode locale and you grep from a file with Unicode characters in it then the character counts won't match up. If you echo $LANG then you'll see the locale you're in.

If you set the LC_CTYPE and/or LANG environment variables to a value ending with ".UTF-8" then you will get the right behaviour:

$ cat data
étuis
letter
éééééé
$ LANG=C grep -E '^.{6}$' data
étuis
letter
$ LANG=en_US.UTF_8 grep -E '^.{6}$' data
letter
éééééé
$

You can change your locale for just a single command by assigning the variable on the same line as the command.

With this configuration, multi-byte characters are considered as single characters. If you want to exclude non-ASCII characters entirely, some of the other answers have solutions for you.


Note that it's still possible for things to break, or at least not do exactly what you expect, in the presence of combining characters. Your grep may treat LATIN SMALL LETTER E + COMBINING CHARACTER ACUTE ABOVE differently than LATIN SMALL LETTER E WITH ACUTE.

Solution 2

With GNU grep when built with PCRE support, you can do:

grep -Px '\X{6}'

While . matches a character, \X matches an ideogram/graphem.

In a UTF-8 locale:

$ locale charmap
UTF-8
$ printf '\u00e9tuis\n\u00e9tudes\n' | grep -Px '\X{6}'
études
$ printf 'e\u0301tuis\ne\u0301tudes\n' | grep -Px '\X{6}'
études

In that latter études, there are 7 characters, 8 bytes, and 6 graphems.

Solution 3

Try this:

LC_ALL=C.UTF-8 grep -x '[_[:alnum:]]\{6\}' file

-x use to match whole line, and defined by POSIX (See grep).

See here for good explanation of what LC_ALL does. You can set LANG or LC_CTYPE to use utf-8 to get the same behaviour. The order taking affect is LC_ALL => LANG => LC_CTYPE.

Solution 4

You could try something like:

grep "^[A-Za-z]\{6\}$" myfile.txt

or if the words may contain numbers too, then:

grep "^[A-Za-z0-9]\{6\}$" myfile.txt

Just add any characters to the square brackets that you want in addition to these.

Share:
49,701
Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    I am trying to find all 6 letter words using grep. I currently have this:

    grep "^.\{6\}$" myfile.txt 
    

    However, I am finding that I am also getting results such as: étuis, étude.

    I suspect it has something do with the symbols above the e in the above words.

    Is there something I can do to ensure that this does not happen?

    Thanks for your help!

  • cuonglm
    cuonglm almost 10 years
    if you use ., something like wăsd's will match
  • Michael Homer
    Michael Homer almost 10 years
    ' is a character that can reasonably be part of a "string with a fixed number of characters".
  • cuonglm
    cuonglm almost 10 years
    Maybe. And you should set both LC_CTYPE and LANG, something like LC_CTYPE=en_US.UTF-8 LANG=en_US will be failed. Use LC_ALL for safety.
  • cuonglm
    cuonglm almost 10 years
    It seems does not work: echo épée | grep -Px '\X{6}' ouput épée
  • Stéphane Chazelas
    Stéphane Chazelas almost 10 years
    @Gnouc, you need to run that in a UTF-8 locale (if those é above were encoded in UTF-8).
  • cuonglm
    cuonglm almost 10 years
    Oh, my mistakes. It works with UTF-8.
  • Alex
    Alex over 6 years
    This won't match the étude at all, because the ASCII character corresponding to the accent will mess up the regex.