How to print lines that only contain characters from a list in BASH?

11,265

Solution 1

grep '^[eat]*$' dictionary.txt

Explanation:

^ = marker meaning beginning of line

$ = marker meaning end of line

[abc] = character class ("match any one of these characters")

* = multiplier for character class (zero or more repetitions)

Solution 2

Unfortunately, I cannot comment, otherwise I'd add to amphetamachine's answer. Anyway, with the updated condition of thousands of search characters you may want to do the following:

grep -f patterns.txt dictionary.txt

where patterns.txt is your regexp:

/^[eat]\+$/

Below is a sample session:

$ cat << EOF > dictionary.txt
> one
> two
> cat
> eat
> four
> tea
> five
> cheat
> EOF
$ cat << EOF > patterns.txt
> ^[eat]\+$
> EOF
$ grep -f patterns.txt dictionary.txt
eat
tea
$

This way you are not limited by the shell (Argument list too long). Also, you can specify multiple patterns in the file:

$ cat patterns.txt
^[eat]\+$
^five$
$ grep -f patterns.txt dictionary.txt
eat
tea
five
$

Solution 3

Try it using awk:

awk '/^[eat]*$/ { print }' dictionary.txt

I found this to be at least an order of magnitude faster than grep for more than about 7 letters. However, I don't know if you will run into the same problem with thousands of letters, as I didn't test that many.

You can even search for multiple patterns at once (this is faster than searching each pattern one at a time, since the dictionary file will be read only once). Every pattern acts as an if statement:

awk '/^[eat]*$/ { print "[eat]: " $0 } /^[cat]*$/ { print "[cat]: " $0 }' dictionary.txt

Solution 4

sed -n '/a/'p words.txt

Use this for whichever letter you need to find. If you want to find more than one letter together, simply repeat the command.

Grep also should not be used for more than the most simple/elementary of searches, IMHO. Although I normally hesitate to call any of the POSIX utilities obsolete, I do try and avoid grep. Its' syntax is extremely inconsistent.

Studying this text file is also recommended. http://sed.sourceforge.net/sed1line.txt

Solution 5

If you want to include e.g. Umlauts in the pattern and not want to have the other accents, set the LC_ALL="C" prior to executing the grep.

This e.g. will give you only the candidate German words in a potential dictionary.txt file.

LC_ALL="C" grep '^[a-zA-ZäÄöÖüÜß]*$' dictionary.txt
Share:
11,265
Village
Author by

Village

Updated on July 26, 2022

Comments

  • Village
    Village almost 2 years

    I have a file called "dictionary.txt" containing a list of all possible words, e.g.:

    a
    aardvark
    act
    anvil
    ate
    ...
    

    How can I search this, only printing lines containing letters from a limited list, e.g., if the list contains the letters "c", "a", and "t", a search will reveal these words:

    a
    act
    cat
    

    If the letters "e", "a", and "t" are searched, only these words are found from "dictionary.txt":

    a
    ate
    eat
    tea
    

    The only solution I have managed is this:

    • Create a list of all possible letters.
    • Delete the searched letters from this list, leaving a list of letters that I do not want to search for.
    • With a for loop cycling each of those letters, delete all lines from the dictionary that contains those letters.
    • Print the remaining words found in the dictionary.

    This solution is very slow. Also, I need to use this code with other languages, which have thousands of possible characters, so this search method is especially slow.

    How can I print only those lines from "dictionary.txt" that only contain the searched-for-letters, and nothing else?

  • amphetamachine
    amphetamachine almost 10 years
    Oh yeah, and it's also fine to say [cheated] which will ignore repeated letters and act like [cheatd]
  • Felix Frank
    Felix Frank almost 10 years
    If this explicit form leads to Argument list too long, it's worth mentioning that the expression can also use ranges such as [e-hk-oz]. This may be helpful if the CJK characters in question have adjacent code points.
  • konsolebox
    konsolebox almost 10 years
    This could also apply to empty lines. Better use \+ or + instead of *.
  • Tom Fenech
    Tom Fenech almost 10 years
    Your first suggestion could be reduced to awk '/^[eat]*$/' dictionary.txt - the {print} is implicit.
  • savanto
    savanto almost 10 years
    @TomFenech That's handy!
  • amphetamachine
    amphetamachine almost 10 years
    @konsolebox - Since when are there empty lines in dictionary wordlist files?
  • Antek
    Antek about 6 years
    using it with real linux dictionary file (consisting hundreds of thousands patterns) takes infinite amount of time to process even a few kb file :)
  • galaxy
    galaxy about 6 years
    Well, the described case "hundreds of thousands" is just a wrong application for this pattern.