Check if all of multiple strings or regexes exist in a file

13,219

Solution 1

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

Solution 2

git grep

Here is the syntax using git grep with multiple patterns:

git grep --all-match --no-index -l -e string1 -e string2 -e string3 file

You may also combine patterns with Boolean expressions such as --and, --or and --not.

Check man git-grep for help.


--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

Other params to consider:

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.

Solution 3

This gnu-awk script may work:

cat fileSearch.awk
re == "" {
   exit
}
{
   split($0, null, "\\<(" re "\\>)", b)
   for (i=1; i<=length(b); i++)
      gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
   exit (re != "")
}

Then use it as:

if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
   echo "all strings were found"
else
   echo "all strings were not found"
fi

Alternatively, you can use this gnu grep solution with PCRE option:

grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file
  • Using -z we make grep read complete file into a single string.
  • We are using multiple lookahead assertions to assert that all the strings are present in the file.
  • Regex must use (?s) or DOTALL mod to make .* match across the lines.

As per man grep:

-z, --null-data
   Treat  input  and  output  data as sequences of lines, each terminated by a 
   zero byte (the ASCII NUL character) instead of a newline.

Solution 4

First, you probably want to use awk. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk, but if you want to do it anyway...

This is based on the following assumptions:G

  • Invoking AWK is unacceptable
  • Invoking grep multiple times is unacceptable
  • The use of any other external tools are unacceptable
  • Invoking grep less than once is acceptable
  • It must return success if everything is found, failure when not
  • Using bash instead of external tools is acceptable
  • bash version is >= 3 for the regular expression version

This might meet all of your requirements: (regex version miss some comments, look at string version instead)

#!/bin/bash

multimatch() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    strings=( "$@" ) # search strings into an array

    declare -a matches # Array to keep track which strings already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#strings[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                string="${strings[$i]}" # fetch the string
                if [[ $line = *$string* ]]; then # check if it matches
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

multimatch_regex() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    regexes=( "$@" ) # Regexes into an array

    declare -a matches # Array to keep track which regexes already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#regexes[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                regex="${regexes[$i]}" # Get regex from array
                if [[ $line =~ $regex ]]; then # We use the bash regex operator here
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

if multimatch "filename" string1 string2 string3; then
    echo "file has all strings"
else
    echo "file miss one or more strings"
fi

if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
    echo "file match all regular expressions"
else
    echo "file does not match all regular expressions"
fi

Benchmarks

I did some benchmarking searching .c,.h and .sh in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...] and that an if can be used to check the result)

Results: (measured with time, real time rounded to closest half second)

(Invoking grep multiple times, especially with the recursive method, did better than I expected)

Solution 5

A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:

#!/bin/bash

patterns="$@"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

for file in *
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

Usage:

./allfilter.sh cat filter java
test.sh

Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".

So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).

But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.

Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:

./allfilter.sh mutable tailrec main println 

Performance:

ls *.scala | wc 
 89      89    2030

In 89 scala files, I have the keywords distribution:

for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done 
16
34
41
71

Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:

time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala    Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala

real    0m0.216s
user    0m0.024s
sys 0m0.028s

in close to 15.000 codelines:

cat *.scala | wc 
  14913   81614  610893

update:

After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:

#!/bin/bash

filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):

for i in {1..20}
do
   ./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done
Share:
13,219

Related videos on Youtube

codeforester
Author by

codeforester

Scripting and automation enthusiast. Fun fact: Code is the reason for bugs. More code = more bugs! Write reusable code!

Updated on June 16, 2022

Comments

  • codeforester
    codeforester about 2 years

    I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

    ...
    string1
    ...
    string2
    ...
    string3
    ...
    string1 string2
    ...
    string1 string2 string3
    ...
    string3 string1 string2
    ...
    string2 string3
    ... and so on
    

    In the above example, we could have regexes in place of strings.

    For example, the following code checks if any of my strings exists in the file:

    if grep -EFq "string1|string2|string3" file; then
      # there is at least one match
    fi
    

    How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

    Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

    Also, is there a solution for strings that can easily be extended for regexes?

    • Ian McGowan
      Ian McGowan about 6 years
    • codeforester
      codeforester about 6 years
      @IanMcGowan: that's not a dupe. That post talks about the all patterns occurring together on the same line. My interest is to find them on the same line or on different lines.
    • Ian McGowan
      Ian McGowan about 6 years
      Ah, good point, my bad. I should have added a question mark.
    • user unknown
      user unknown about 6 years
      Do you have some typical, practical numbers in mind like numbers of patterns (4, 16, 64, ...) and files to search (10, 1000, 100000)?
    • codeforester
      codeforester about 6 years
      @userunknown: up to a few thousand patterns is reasonable.
    • user unknown
      user unknown about 6 years
      @codeforester: Wow, few thousands, that seems an important information. For such big numbers, it might get problematic to pass them as arguments via the shell. Maybe it is possible to split them into chunks of some hundreds of parameters, filter a filelist and get a filtered filelist from the first invocation, to produce a reduced filelist as input for the second invocation with the next chunk and so on. Instead of passing the files and patterns as paremters, one could pass filenames containing these, with few modifications this would be possible with my, but surely with other scripts too.
    • glerYbo
      glerYbo about 6 years
    • Ed Morton
      Ed Morton about 6 years
      I'm completely aghast at how many times I see the word "pattern" in this Q&A and all of the answers that search for regexps using the word "string" instead. Saying you want to find a "pattern" is like losing your dog and putting up flyers that just say "lost: 1 animal" and saying you're working with a string when it's really a regexp is like handing someone a cat and trying to convince them it's really their missing dog!
    • Gert van den Berg
      Gert van den Berg about 6 years
      Why does this have an awk and python tag if it asks about doing it without them?
  • learningbee
    learningbee about 6 years
    This would work, however, OP says he doesn't want to call grep multiple times.
  • Ian McGowan
    Ian McGowan about 6 years
    Acknowledged, but I don't think there's another way to do it with plain grep, and at least the -q and && will shortcut some of the multiple passes over the data files.
  • anubhava
    anubhava about 6 years
    No grep solution works irrespective of any order of appearance of those strings in the file. Lookahead assertions just make sure those strings are present anywhere in the file. So grep can also be written as: grep -qzP '(?s)(?=.*\bstring3\b)(?=.*\bstring1\b)(?=.*\bstring2\b)' file
  • anubhava
    anubhava about 6 years
    Indeed, your gnu-awk solution is neat
  • codeforester
    codeforester about 6 years
    This is a good answer. Started a bounty to attract even more attention.
  • Ed Morton
    Ed Morton about 6 years
    Ok but you really should clarify if you want a solution for strings or regexps and whether you need "full word" matches only or partial matches.
  • Leon
    Leon about 6 years
    uniq eliminates only adjacent duplicates. That's why I used sort -u in my answer.
  • Leon
    Leon about 6 years
    OP is looking for a solution that doesn't require invoking grep multiple times (per file).
  • user unknown
    user unknown about 6 years
    @Leon: That's a question, whether it is possible, and at least for all files, which don't match a former pattern, grep is not invoked any longer and for each match it is early stopped, so by an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast.
  • Anna
    Anna about 6 years
    @Leon yes, absolutely makes sense. Thank you for correcting me!
  • Ed Morton
    Ed Morton about 6 years
    Good. Really think long and hard about what each solution is doing as your posted sample input doesn't cover all of the cases it needs to to prove if a solution works or not so while all solutions will produce the expected output given that sample input, most of them will fail given different input. So either take your time to figure out different sample input that covers all of your requirements (in particular files that should not match but are hard to handle) or really think through the logic of each solution to make sure it does address all of your needs.
  • Ed Morton
    Ed Morton about 6 years
    Oh, and also make sure you understand why-is-using-a-shell-loop-to-process-text-considered-bad-pra‌​ctice if you're considering any solution that involves a shell loop, and consider what the given solution will do wrt memory and speed of execution for a huge file.
  • Gert van den Berg
    Gert van den Berg about 6 years
    Some benchmarking (say on the scala file example) would be interesting... It is likely significantly slower than AWK - it is mainly an exercise to show that the stated requirements can be met... (Speed was not mentioned as a requirement) (This also seems to use no external processes - that can be good for speed, but bash text processing, which can be bad for speed, at least compared to C code in grep...)
  • kvantour
    kvantour about 6 years
    @EdMorton Very nice answer! I am missing only one thing, what if the searched string is split by a newline. Imagine a textbook and the string you want to search for is broken apart by a newline. It is a bit more difficult but it could be a nice addition to this answer!
  • user unknown
    user unknown about 6 years
    This reports multiple matches per file. But with time git grep --name-only --all-match -e ... it only reports the matching files and it is impressive fast - my solution was about 0.2s for my testcase, yours took 0.02s. And the option '-f patternfile` might be interesting too. Most important maybe: It is simple, clear and compact and surely gets over time more support than my homegrown script.
  • Ed Morton
    Ed Morton about 6 years
    Thanks, but there's nothing special about a newline, you just treat it like any other character. If the OP wants to search for a string that contains a newline then they just have to set the RS to whatever is the record separator instead of a newline (e.g. \0 or ^$ as I show above to search the whole file or maybe "" if blank lines separate the records or [[:punct:]] or something else) and include a newline in the string that's being searched with, just like any other character.
  • Ed Morton
    Ed Morton about 6 years
    That script will not look for any strings, it will look for 4 regexps. If you want it to act as if it were looking for strings then see stackoverflow.com/q/29613304/1745001 for how to do that.
  • Ed Morton
    Ed Morton about 6 years
    Yes, you're right, it would be several orders of magnitude slower than an awk script. See why-is-using-a-shell-loop-to-process-text-considered-bad-pra‌​ctice. You use the word pattern in several places in your script - that's a highly ambiguous term which as such should generally be avoided so please clarify each time you use it if you are talking about a string, a regexp, a globbing pattern or something else.
  • Gert van den Berg
    Gert van den Berg about 6 years
    @EdMorton: It seems to be the only way to avoid standard tools (like AWK) (or non-standard tools, like Perl) (I'm working on the assumption that the only acceptable non-shell tool is grep, invoked only once...) (The question asks whether it is "possible", not whether it is a good idea ;-) ) "Pattern" was used for the search string (which can either be a regex or a plain string matched with globbing, depending on the variant) (The regex version was done once I realised that bash had built-in regexes, and is a simple modification of the first version) (detailed comments was added after fork)
  • Gert van den Berg
    Gert van den Berg about 6 years
    @EdMorton: A lot (but not all) of the criticism of the read method is invoking external tools multiple times, which this avoids. (It does need a blank IFS to keep it from stripping leading whitespace (I know that with a single variable, whitespace between terms are not affected by $IFS))
  • Gert van den Berg
    Gert van den Berg about 6 years
    (assuming "use" should be "using" in the quoted part not "to instead use")
  • kvantour
    kvantour about 6 years
    And this is why SO is so great. Everybody has different ways of approaching a problem and from time to time something unexpected pops up.
  • codeforester
    codeforester about 6 years
    Instead of using awk to create the expression for grep, we could definitely use grep -f which would work much better when we have a lot of strings to search for. That way, it would be easier to handle any special characters in the strings.
  • Ed Morton
    Ed Morton about 6 years
    Ignoring the UUOCs, this searches for regexps, not strings. You're using awk to create a regexp like regexp1|regexp2|... that gets saved in the badly-named variable STR and then you're using grep -o to find it. The grep -o MUST do a regexp match since you're relying on the |s in STR so every part of STR is treated as a regexp and there's nothing you can do to fix that. It will also fail in several ways, e.g. when one "string" is part of another, e.g. try $ echo 'footherebar' | grep -E -o 'the|there' and you'll find the output is just there, not the and there as required. `
  • Ed Morton
    Ed Morton about 6 years
    The OP tells us she has thousands of strings to search for and so hard-coding them all as in your first script won't be practical (plus it's searching for regexps, not strings). Your second script is similar to mine in terms of using a counter but instead of reducing the array of strings on every loop like mine does, yours is creating a duplicate array of strings and so will be slower than mine and use twice as much memory. It wasn't me who downvoted btw.
  • revo
    revo about 6 years
    Where am I creating duplicate array of strings? and I don't get you by * it's searching for regexps, not strings* since whatever the input is, a pattern or a literal string, it doesn't confuse. @EdMorton
  • Ed Morton
    Ed Morton about 6 years
    In your first script /string/ is defining a repexp literal with the contents string and then doing a regexp comparison of that against $0, i.e. it's searching for a regexp not a string. In your second script you have an array args[] that contains all of your strings and each time you match a string in the input you're adding it to the array temp[] so when all strings are found in the input you end up with temp[] being a duplicate of args[].
  • Ed Morton
    Ed Morton about 6 years
    Oh, hang on - I misread your second script. I assumed you were using args[] to contain the arg strings as indices so you could do a simple loop on the indices and a string comparison like I'm doing but you aren't you're storing the strings as array contents and looping on numeric indices and then dereferencing the array each time and doing a regexp comparison! So arg in your second script isn't actually an arg (one of the strings passed in) it's an index and the arg/string is actually at args[arg] so you aren't creating a dup in temp[] but there's other issues instead.
  • revo
    revo about 6 years
    I guessed you did probably on if condition... but let's talk about strings vs regexps. I see it as one single solution for both. You provided two individually and I'm not sure how it acts inefficiently on strings vs regexps. Maybe a benchmark needed? @EdMorton
  • Ed Morton
    Ed Morton about 6 years
    Try both solutions when the string being searched for is .*. Mine will find the exact string .* (i.e.a period followed by an asterisk) if and only if it exists in the input file while yours will match on the first line no matter what the input file contents since .* will be treated as the regexp "zero or more repetitions of any character". wrt efficiency, I'm not considering the efficiency of regexp vs string comparisons, I'm removing each string from my array when it matches so the array gets smaller and so faster to loop through on every match. Feel free to do timing tests if you like.
  • revo
    revo about 6 years
    Isn't it dependent on OP? in case of .* you are right but if OP doesn't have quantifiers or regex meta characters in his / her arguments then what would be the issue? and since a line is read at a time, even .* which prints wrong substring, talking performance-wise, doesn't differ from matching a literal string like s. Even if whole input string is one single line by the mean of \n being the RS. @EdMorton
  • Ed Morton
    Ed Morton about 6 years
    Right, if the OP chooses not to have any regexp metacharacters then there are no false matches produced by regexp metacharacters but why impose that restriction when you can just as easily do string comparisons? The OP said she wants to find strings so why then say 'ok but you can't simply look for "foo.c" in a list of file names', etc.. Again, when I'm talking about performance I'm not talking about the difference in a string vs regexp comparison, I'm talking about reducing the number of values being looped on for every time a value is found vs keeping the number of iterations constant.
  • Gert van den Berg
    Gert van den Berg about 6 years
    The new version, can probably deal with plain strings (by adding a -F to the grep) (and without -F or with -G, standard regexes) (And can deal with extended regexes with -E or other non-standard variants, like -P for PCRE regexes in GNU grep)
  • Ed Morton
    Ed Morton about 6 years
    It still can't handle strings being contained in other strings though. Again, try any variation of echo 'footherebar' | grep -E -o 'the|there' and it'll find there but not the.
  • Ed Morton
    Ed Morton about 6 years
    Best I can tell git grep -q --all-match --no-index -l -F -f strings.txt file seems like it'd work just fine as long as you don't mind installing git tools. Does it read the whole file into memory at one time or does it store 1 line at a time in memory? Is there any limit on how many strings can be searched for at one time?
  • glerYbo
    glerYbo about 6 years
    @EdMorton There are no string limits, it'll just take longer. In terms of memory management, I'm not sure, you can check the source code of grep.c (look for all_match). But in overall, it's designed to be quick.
  • Ed Morton
    Ed Morton about 6 years
    Thanks for doing the benchmarking. Since the OP is searching for thousands of strings could you try again with a large number of strings (e.g. at least 1,000, all of which appear in the target file,and some of which are subsets of each other and some of which contain regexp metachars)? There's huge differences between how the various solutions will perform as the number of strings to be searched for gets large (and match) plus some solutions will fail given strings that are substrings of other strings or strings that contain RE chars and those differences won't show up for just those 3 strings
  • Ed Morton
    Ed Morton about 6 years
    Thanks, best I can tell from reading that source file it always reads the whole input file into memory.
  • Gautam
    Gautam about 6 years
    Yes, Strangely, if the contents of my strings.txt is "there\nthe" , it matches both the strings, whereas if it is "the\nthere" it matches only one. :P . I have modified my answer to say it isn't able to though.
  • Gert van den Berg
    Gert van den Berg about 6 years
    @EdMorton: It becomes tricky - many of the solutions have different interfaces that becomes tricky (and possibly slow) to map. The CLI ones might need a non-shell method to invoke them to build a argv close to ARG_MAX (Although they can be arbitrarily combined if they have proper exit codes with && operators, with the disadvantage of multiple scans of the file if the first part matches)
  • Gert van den Berg
    Gert van den Berg about 6 years
    (The output also needs to be checked) (Those strings (and extensions) on the test set was mainly chosen that some files relatively quickly match all strings (to disadvantage things that scan entire files multiple times irrespective of matches) and have some files unlikely to match all strings (The .sh ones) - for performance when the whole file needs to be scanned) (I started with the entire source but ran out of patience waiting for my shell ones)
  • Gert van den Berg
    Gert van den Berg about 6 years
    To do a search without regexes, you can use index (see your local awk man page) instead of =~. Two variants is one option, another is to have a setting that chooses whether index or =~ is used.
  • Ed Morton
    Ed Morton about 6 years
    I bet you don't get a performance improvement over the awk solution since awk is highly optimized for text processing, usually out-performing compiled languages like "C" or that task. Remember we're looking for strings btw, not regexps.
  • Ed Morton
    Ed Morton about 6 years
    Is that really looking for strings or is it looking for regexps? I mean if string1 was ".*" would it only match against the string ".*" in the input or would it match against any sequence of characters?
  • Ed Morton
    Ed Morton about 6 years
    @GertvandenBerg Right, you can use index() to do a string instead of regexp comparison and you can delete the matched strings from args instead of adding flags to temp to improve performance and then you end up with my answer
  • Gert van den Berg
    Gert van den Berg about 6 years
    @EdMorton: For non-regexes, a C version based on this on mmaped files runs my benchmark in around 1.5 seconds... (Using argv for search strings, not a file) (Extending the simple C case to regexes is not that easy though....)
  • Ed Morton
    Ed Morton about 6 years
    Again, it's important to be searching for 1,000+ strings like the OP says is realistic.Searching for, say, 3 strings really isn't very useful wrt comparative timing.
  • binish
    binish about 6 years
    Its technically regexps, but in this context it is looking for strings. If there are any metachars, you need to escape appropriately.
  • Ed Morton
    Ed Morton about 6 years
    Doesn't perl have some option to do the escaping for you (like shells printf '%q') so your regexps kinda-sorta get string-like treatment? I thought I saw that somewhere. In any case, its very misleading to use the name "string" for your regexps. Seems to be a common theme in this thread though, idk why....
  • Gene
    Gene about 6 years
    @EdMorton Awk is a great tool, but it uses a regex interpreter. On 5,000 patterns, it will proceed by attempting to match each of the 5,000 patterns in sequence for each input character. Flex will compile all 5000 patterns into a single DFA that executes a few instructions per input character. That's why compiler scanners - where performance impacts every program ever compiled for the life of the scanner - are implemented with DFAs, not regex engines.
  • Ed Morton
    Ed Morton about 6 years
    We are not working with regexps in this question we are working with strings. Awks regexp engine is not being used and even if it were it would not be working with a single regexp and the algorithm used would reduce the number of regexps each time one is found.
  • Gene
    Gene about 6 years
    @EdMorton Actually the question says "we could have regexes in place of strings." Whether the patterns are simple strings or use alternation and Kleene star, my point is the same. For each character, an interpreter will - for each character - try each pattern in sequence until it finds a match. A DFA will do the equivalent of a C switch branch on each character and then move on to the next one.
  • Ed Morton
    Ed Morton about 6 years
    I took that as meaning the strings could contain regexp metacharacters but they should be treated as literal strings regardless. That's fine but that's not what a reasonable awk solution to the problem of finding multiple strings in a file will do, it'll simply loop through an array of those strings and remove each one that's found in the input file until the array is empty or the end of file is reached. If you think your approach will be faster then just try it and let us know the results.
  • codeforester
    codeforester about 6 years
    Is this optimal for large files?
  • codeforester
    codeforester about 6 years
    One of the downsides here is that it reads the entire file always and doesn't stop as soon as all strings match.
  • codeforester
    codeforester about 6 years
    Probably cmp is better than diff in this context.
  • kevinarpe
    kevinarpe over 2 years
    This is an excellent answer. Thank you to share! However, I found a strange issue. If <pathspec> has one or more absolute paths, git will crash with an error like: BUG: environment.c:202: git environment hasn't been setup However, relative paths do not cause this issue. I am using git grep to search multiple absolute paths with multiple logic-and regexp. Can this be done?