How to print lines that only contain characters from a list in BASH?
Solution 1
grep '^[eat]*$' dictionary.txt
Explanation:
^
= marker meaning beginning of line
$
= marker meaning end of line
[abc]
= character class ("match any one of these characters")
*
= multiplier for character class (zero or more repetitions)
Solution 2
Unfortunately, I cannot comment, otherwise I'd add to amphetamachine's answer. Anyway, with the updated condition of thousands of search characters you may want to do the following:
grep -f patterns.txt dictionary.txt
where patterns.txt
is your regexp:
/^[eat]\+$/
Below is a sample session:
$ cat << EOF > dictionary.txt
> one
> two
> cat
> eat
> four
> tea
> five
> cheat
> EOF
$ cat << EOF > patterns.txt
> ^[eat]\+$
> EOF
$ grep -f patterns.txt dictionary.txt
eat
tea
$
This way you are not limited by the shell (Argument list too long). Also, you can specify multiple patterns in the file:
$ cat patterns.txt
^[eat]\+$
^five$
$ grep -f patterns.txt dictionary.txt
eat
tea
five
$
Solution 3
Try it using awk
:
awk '/^[eat]*$/ { print }' dictionary.txt
I found this to be at least an order of magnitude faster than grep for more than about 7 letters. However, I don't know if you will run into the same problem with thousands of letters, as I didn't test that many.
You can even search for multiple patterns at once (this is faster than searching each pattern one at a time, since the dictionary file will be read only once). Every pattern acts as an if statement:
awk '/^[eat]*$/ { print "[eat]: " $0 } /^[cat]*$/ { print "[cat]: " $0 }' dictionary.txt
Solution 4
sed -n '/a/'p words.txt
Use this for whichever letter you need to find. If you want to find more than one letter together, simply repeat the command.
Grep also should not be used for more than the most simple/elementary of searches, IMHO. Although I normally hesitate to call any of the POSIX utilities obsolete, I do try and avoid grep. Its' syntax is extremely inconsistent.
Studying this text file is also recommended. http://sed.sourceforge.net/sed1line.txt
Solution 5
If you want to include e.g. Umlauts in the pattern and not want to have the other accents, set the LC_ALL="C"
prior to executing the grep.
This e.g. will give you only the candidate German words in a potential dictionary.txt file.
LC_ALL="C" grep '^[a-zA-ZäÄöÖüÜß]*$' dictionary.txt
Village
Updated on July 26, 2022Comments
-
Village almost 2 years
I have a file called "dictionary.txt" containing a list of all possible words, e.g.:
a aardvark act anvil ate ...
How can I search this, only printing lines containing letters from a limited list, e.g., if the list contains the letters "c", "a", and "t", a search will reveal these words:
a act cat
If the letters "e", "a", and "t" are searched, only these words are found from "dictionary.txt":
a ate eat tea
The only solution I have managed is this:
- Create a list of all possible letters.
- Delete the searched letters from this list, leaving a list of letters that I do not want to search for.
- With a for loop cycling each of those letters, delete all lines from the dictionary that contains those letters.
- Print the remaining words found in the dictionary.
This solution is very slow. Also, I need to use this code with other languages, which have thousands of possible characters, so this search method is especially slow.
How can I print only those lines from "dictionary.txt" that only contain the searched-for-letters, and nothing else?
-
amphetamachine almost 10 yearsOh yeah, and it's also fine to say
[cheated]
which will ignore repeated letters and act like[cheatd]
-
Felix Frank almost 10 yearsIf this explicit form leads to
Argument list too long
, it's worth mentioning that the expression can also use ranges such as[e-hk-oz]
. This may be helpful if the CJK characters in question have adjacent code points. -
konsolebox almost 10 yearsThis could also apply to empty lines. Better use
\+
or+
instead of*
. -
Tom Fenech almost 10 yearsYour first suggestion could be reduced to
awk '/^[eat]*$/' dictionary.txt
- the{print}
is implicit. -
savanto almost 10 years@TomFenech That's handy!
-
amphetamachine almost 10 years@konsolebox - Since when are there empty lines in dictionary wordlist files?
-
Antek about 6 yearsusing it with real linux dictionary file (consisting hundreds of thousands patterns) takes infinite amount of time to process even a few kb file :)
-
galaxy about 6 yearsWell, the described case "hundreds of thousands" is just a wrong application for this pattern.