Search for text files where two different words exist (any order, any line)

9,390

Solution 1

With GNU tools:

find . -type f  -exec grep -lZ FIND {} + | xargs -r0 grep -l ME

You can do standardly:

find . -type f -exec grep -q FIND {} \; -exec grep -l ME {} \;

But that would run up to two greps per file. To avoid running that many greps and still be portable while still allowing any character in file names, you could do:

convert_to_xargs() {
  sed "s/[[:blank:]\"\']/\\\\&/g" | awk '
    {
      if (NR > 1) {
        printf "%s", line
        if (!index($0, "//")) printf "\\"
        print ""
      }
      line = $0
    }'
    END { print line }'
}

export LC_ALL=C
find .//. -type f |
  convert_to_xargs |
  xargs grep -l FIND |
  convert_to_xargs |
  xargs grep -l ME

The idea being to convert the output of find into a format suitable for xargs (that expects a blank (SPC/TAB/NL in the C locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).

Generally you can't post-process the output of find -print, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:

./a
./b

We've got no way to know whether it's one file called b in a directory called a<NL>. or if it's the two files a and b in the current directory.

By using .//., because // cannot appear otherwise in a file path as output by find (because there's no such thing as a directory with an empty name and / is not allowed in a file name), we know that if we see a line that contains //, then that's the first line of a new filename. So we can use that awk command to escape all newline characters but those that precede those lines.

If we take the example above, find would output in the first case (one file):

.//a
./b

Which awk escapes to:

.//a\
./b

So that xargs sees it as one argument. And in the second case (two files):

.//a
.//b

Which awk would leave as is, so xargs sees two arguments.

You need the LC_ALL=C so sed, awk (and some implementations of xargs) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.

Solution 2

If the files are in a single directory and their name don't contain space, tab, newline, *, ? nor [ characters and don't start with - nor ., this will get a list of files containing ME, then narrow that down to the ones that also contain FIND.

grep -l FIND `grep -l ME *`

Solution 3

Or use egrep -e or grep -E like this:

find . -type f -exec egrep -le '(ME.*FIND|FIND.*ME)' {} \;

or

find . -type f -exec grep -lE '(ME.*FIND|FIND.*ME)' {} +

The + makes find (if supported) add multiple file(path)names as arguments to the command being -execed. This saves processes and is a lot quicker than \; which invokes the command once for each file found.

-type f matches only files, to avoid grepping on a directory.

'(ME.*FIND|FIND.*ME)' is a regular expression matching any line containing "ME" followed by "FIND" or "FIND" followed by "ME". (single quotes to prevent the shell from interpreting special characters).

Add a -i to the grep command to make it case-insensitive.

To only match lines where "FIND" comes before "ME", use 'FIND.*ME'.

To require spaces (1 or more, but nothing else) between the words: 'FIND +ME'

To allow spaces (0 or more, but nothing else) between the words: 'FIND *ME'

The combinations are endless with regular expressions, and provided you are interested in matching only on a row-at-a-time basis, egrep is very powerful.

Solution 4

With awk you could also run:

find . -type f  -exec awk 'BEGIN{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; END{if (cx > 0 && cy > 0) print FILENAME}' {} \;

It uses cx and cy to count for lines matching FIND and respectively ME. In the END block, if both counters > 0, it prints the FILENAME.
This would be faster/more efficient with gnu awk:

find . -type f  -exec gawk 'BEGINFILE{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; ENDFILE{if (cx > 0 && cy > 0) print FILENAME}' {} +

Solution 5

TL&DR

Note: You have to test which one is the fastest for yourself.

grep -rlzE '(TermOne.*TermTwo)|(TermTwo.*TermOne)'    # GNU grep

find . -type f -exec grep -q 'TermOne' {} \; \
               -exec grep -q 'TermTwo' {} \; \
               -print

awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
     /TermTwo/{if(p==0)p=2; if(p==1)p=3}
     p==3{print FILENAME;p=0;nextfile}' ./*

One File

There is no way to build a regex that could match two separate strings in a file.

It is possible to search for two terms with either alternation:

grep -E '(TermOne.*TermTwo)|(TermTwo.*TermOne)' file

or lookahead:

grep -P '(?=.*TermOne)(?=.*TermTwo)' file

but only if the two terms are on the same line

It is also possible to make the whole file act as one file (if the file doesn't contain NULs. Unix text files don't) with the GNU grep -z option:

grep -zE '(TermOne.*TermTwo)|(TermTwo.*TermOne)' file

It is not possible to use -z with -P at the same time, so, no lookahead solutions possible as of today.

The other alternative is to grep twice:

<file grep 'TermOne' | grep -q 'TermTwo'

The exit code of the whole pipe will signal 0 only if both terms were found in one file.

Or, to use awk:

awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
     /TermTwo/{if(p==0)p=2; if(p==1)p=3}
     p==3{print "both terms found"; exit}' file

list files

The first two solutions from above will work to recursively list all files by adding the options -r (recursive, which then there is no need for a filename), -l (list matching filenames) and -z (assume the whole file is one line).

grep -rlzE '(TermOne.*TermTwo)|(TermTwo.*TermOne)'

Or, using find (two grep calls):

find . -type f -exec grep -q 'TermOne' {} \; \
               -exec grep -q 'TermTwo' {} \; \
               -print

Or, using awk (the glob will include only the PWD):

awk '/TermOne/{if(p==0)p=1; if(p==2)p=3}
     /TermTwo/{if(p==0)p=2; if(p==1)p=3}
     p==3{print FILENAME;p=0;nextfile}' ./*
Share:
9,390

Related videos on Youtube

Chad Harrison
Author by

Chad Harrison

Updated on September 18, 2022

Comments

  • Chad Harrison
    Chad Harrison almost 2 years

    I'm looking for a way to search files where two word instances exist in the same file. I've been using the following to perform my searches up to this point:

    find . -exec grep -l "FIND ME" {} \;
    

    The problem I'm running into is that if there isn't exactly one space that between "FIND" and "ME", the search result does not yield the file. How do I adapt the former search string where both words "FIND" and "ME exist in a file as opposed to "FIND ME"?

    I'm using AIX.

    • Admin
      Admin almost 9 years
      Do the words exist anywhere in the file, or are they always on the same line?
    • Admin
      Admin almost 7 years
      An alternative, if the words are on the same line is to use a regular expression with grep -E / egrep that describes all patterns you are interested in (and using + instead of ; if your find has support for +.
  • Ryan B
    Ryan B over 7 years
    THIS needs more upvotes!! Far more elegant than the "accepted" answer. Worked for me.
  • razzed
    razzed almost 7 years
    Why not use find ... -print0 and grep --null instead?
  • Stéphane Chazelas
    Stéphane Chazelas almost 7 years
    @razzed, not sure what you mean those. grep --null (aka -Z) is used in the first one but is a GNU extension. -print0 (another GNU extension) would not help here.
  • Stéphane Chazelas
    Stéphane Chazelas almost 7 years
    --null, --print0, -0 are all GNU extensions. Though some of them are found in other implementations nowadays, they're still not portable and not in the POSIX or Unix standard.
  • Tim
    Tim about 6 years
    I should have asked why it was not necessary as it was not seen in your command.
  • Stéphane Chazelas
    Stéphane Chazelas about 6 years
    @Tim, In the first two examples, find doesn't output anything, it's grep that outputs the file list NUL delimited. The 3rd one is intending to be portable. So it's all about using an approach that doesn't use NUL delimited records as text with NULs can't be processed portably by text utilities.
  • stolenmoment
    stolenmoment about 6 years
    Do most greps not support "-r"? That would eliminate the "find", but there might be sockets or other non-plain files in the tree being searched.
  • MattBianco
    MattBianco about 6 years
    OP uses AIX and had find in the question.
  • dave_thompson_085
    dave_thompson_085 about 4 years
    Non-gnu awk can do the more efficient multiple files method with the somewhat clumsier 'FNR==1{if(x&&y)print f;x=y=0} /FIND/{x=1} /ME/{y=1} {f=FILENAME} END{if(x&&y)print f}'
  • Stéphane Chazelas
    Stéphane Chazelas about 4 years
    @Isaac, that does assume the files don't contain NUL characters though (-z is for NUL-delimited records, not a slurp mode). Note that it's relatively recently that GNU grep started accepting no filename with -r (to search in .). Yes, speed is not going to be great for large files with no match. The whole file (for those files that don't contain NULs) also ends up loaded in memory.
  • dotancohen
    dotancohen over 3 years
    For those on Linux or other GNU systems, this is a terrific answer. The find command can easily be changed to search in only e.g. .py files, and more grep pipes can be chained on easily.