Search for text files where two different words exist (any order, any line)
Solution 1
With GNU tools:
find . -type f -exec grep -lZ FIND {} + | xargs -r0 grep -l ME
You can do standardly:
find . -type f -exec grep -q FIND {} \; -exec grep -l ME {} \;
But that would run up to two grep
s per file. To avoid running that many grep
s and still be portable while still allowing any character in file names, you could do:
convert_to_xargs() {
sed "s/[[:blank:]\"\']/\\\\&/g" | awk '
{
if (NR > 1) {
printf "%s", line
if (!index($0, "//")) printf "\\"
print ""
}
line = $0
}'
END { print line }'
}
export LC_ALL=C
find .//. -type f |
convert_to_xargs |
xargs grep -l FIND |
convert_to_xargs |
xargs grep -l ME
The idea being to convert the output of find
into a format suitable for xargs (that expects a blank (SPC/TAB/NL in the C
locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).
Generally you can't post-process the output of find -print
, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:
./a
./b
We've got no way to know whether it's one file called b
in a directory called a<NL>.
or if it's the two files a
and b
in the current directory.
By using .//.
, because //
cannot appear otherwise in a file path as output by find
(because there's no such thing as a directory with an empty name and /
is not allowed in a file name), we know that if we see a line that contains //
, then that's the first line of a new filename. So we can use that awk
command to escape all newline characters but those that precede those lines.
If we take the example above, find
would output in the first case (one file):
.//a
./b
Which awk escapes to:
.//a\
./b
So that xargs
sees it as one argument. And in the second case (two files):
.//a
.//b
Which awk
would leave as is, so xargs
sees two arguments.
You need the LC_ALL=C
so sed
, awk
(and some implementations of xargs
) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.
Solution 2
If the files are in a single directory and their name don't contain space, tab, newline, *
, ?
nor [
characters and don't start with -
nor .
, this will get a list of files containing ME, then narrow that down to the ones that also contain FIND.
grep -l FIND `grep -l ME *`
Solution 3
Or use egrep -e
or grep -E
like this:
find . -type f -exec egrep -le '(ME.*FIND|FIND.*ME)' {} \;
or
find . -type f -exec grep -lE '(ME.*FIND|FIND.*ME)' {} +
The +
makes find (if supported) add multiple file(path)names as arguments to the command being -exec
ed. This saves processes and is a lot quicker than \;
which invokes the command once for each file found.
-type f
matches only files, to avoid grepping on a directory.
'(ME.*FIND|FIND.*ME)'
is a regular expression matching any line containing "ME" followed by "FIND" or "FIND" followed by "ME". (single quotes to prevent the shell from interpreting special characters).
Add a -i
to the grep
command to make it case-insensitive.
To only match lines where "FIND" comes before "ME", use 'FIND.*ME'
.
To require spaces (1 or more, but nothing else) between the words: 'FIND +ME'
To allow spaces (0 or more, but nothing else) between the words: 'FIND *ME'
The combinations are endless with regular expressions, and provided you are interested in matching only on a row-at-a-time basis, egrep is very powerful.
Solution 4
With awk
you could also run:
find . -type f -exec awk 'BEGIN{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; END{if (cx > 0 && cy > 0) print FILENAME}' {} \;
It uses cx
and cy
to count for lines matching FIND
and respectively ME
. In the END
block, if both counters > 0, it prints the FILENAME
.
This would be faster/more efficient with gnu awk
:
find . -type f -exec gawk 'BEGINFILE{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; ENDFILE{if (cx > 0 && cy > 0) print FILENAME}' {} +
Solution 5
TL&DR
Note: You have to test which one is the fastest for yourself.
grep -rlzE '(TermOne.*TermTwo)|(TermTwo.*TermOne)' # GNU grep
find . -type f -exec grep -q 'TermOne' {} \; \
-exec grep -q 'TermTwo' {} \; \
-print
awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
/TermTwo/{if(p==0)p=2; if(p==1)p=3}
p==3{print FILENAME;p=0;nextfile}' ./*
One File
There is no way to build a regex that could match two separate strings in a file.
It is possible to search for two terms with either alternation:
grep -E '(TermOne.*TermTwo)|(TermTwo.*TermOne)' file
or lookahead:
grep -P '(?=.*TermOne)(?=.*TermTwo)' file
but only if the two terms are on the same line
It is also possible to make the whole file act as one file (if the file doesn't contain NULs. Unix text files don't) with the GNU grep -z
option:
grep -zE '(TermOne.*TermTwo)|(TermTwo.*TermOne)' file
It is not possible to use -z
with -P
at the same time, so, no lookahead solutions possible as of today.
The other alternative is to grep twice:
<file grep 'TermOne' | grep -q 'TermTwo'
The exit code of the whole pipe will signal 0
only if both terms were found in one file.
Or, to use awk:
awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
/TermTwo/{if(p==0)p=2; if(p==1)p=3}
p==3{print "both terms found"; exit}' file
list files
The first two solutions from above will work to recursively list all files by adding the options -r
(recursive, which then there is no need for a filename), -l
(list matching filenames) and -z
(assume the whole file is one line).
grep -rlzE '(TermOne.*TermTwo)|(TermTwo.*TermOne)'
Or, using find (two grep calls):
find . -type f -exec grep -q 'TermOne' {} \; \
-exec grep -q 'TermTwo' {} \; \
-print
Or, using awk (the glob will include only the PWD):
awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
/TermTwo/{if(p==0)p=2; if(p==1)p=3}
p==3{print FILENAME;p=0;nextfile}' ./*
Related videos on Youtube
Chad Harrison
Updated on September 18, 2022Comments
-
Chad Harrison almost 2 years
I'm looking for a way to search files where two word instances exist in the same file. I've been using the following to perform my searches up to this point:
find . -exec grep -l "FIND ME" {} \;
The problem I'm running into is that if there isn't exactly one space that between "FIND" and "ME", the search result does not yield the file. How do I adapt the former search string where both words "FIND" and "ME exist in a file as opposed to "FIND ME"?
I'm using AIX.
-
Admin almost 9 yearsDo the words exist anywhere in the file, or are they always on the same line?
-
Admin almost 7 yearsAn alternative, if the words are on the same line is to use a regular expression with
grep -E
/egrep
that describes all patterns you are interested in (and using+
instead of;
if your find has support for+
.
-
-
Ryan B over 7 yearsTHIS needs more upvotes!! Far more elegant than the "accepted" answer. Worked for me.
-
razzed almost 7 yearsWhy not use
find ... -print0
andgrep --null
instead? -
Stéphane Chazelas almost 7 years@razzed, not sure what you mean those.
grep --null
(aka -Z) is used in the first one but is a GNU extension.-print0
(another GNU extension) would not help here. -
Stéphane Chazelas almost 7 years
--null
,--print0
,-0
are all GNU extensions. Though some of them are found in other implementations nowadays, they're still not portable and not in the POSIX or Unix standard. -
Tim about 6 yearsI should have asked why it was not necessary as it was not seen in your command.
-
Stéphane Chazelas about 6 years@Tim, In the first two examples,
find
doesn't output anything, it'sgrep
that outputs the file list NUL delimited. The 3rd one is intending to be portable. So it's all about using an approach that doesn't use NUL delimited records as text with NULs can't be processed portably by text utilities. -
stolenmoment about 6 yearsDo most greps not support "-r"? That would eliminate the "find", but there might be sockets or other non-plain files in the tree being searched.
-
MattBianco about 6 yearsOP uses AIX and had
find
in the question. -
dave_thompson_085 about 4 yearsNon-gnu awk can do the more efficient multiple files method with the somewhat clumsier
'FNR==1{if(x&&y)print f;x=y=0} /FIND/{x=1} /ME/{y=1} {f=FILENAME} END{if(x&&y)print f}'
-
Stéphane Chazelas about 4 years@Isaac, that does assume the files don't contain NUL characters though (
-z
is for NUL-delimited records, not a slurp mode). Note that it's relatively recently that GNU grep started accepting no filename with-r
(to search in.
). Yes, speed is not going to be great for large files with no match. The whole file (for those files that don't contain NULs) also ends up loaded in memory. -
dotancohen over 3 yearsFor those on Linux or other GNU systems, this is a terrific answer. The
find
command can easily be changed to search in only e.g..py
files, and moregrep
pipes can be chained on easily.