Check if all of multiple strings or regexes exist in a file
Solution 1
Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.
In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:
awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file
And here's a bunch of other information and options:
Assuming you're really looking for strings, it'd be:
awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file
the above will stop reading the file as soon as all strings have matched.
If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:
awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file
Actually, even if it were strings you could do:
awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file
The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.
I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:
awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched
and for regexps it'd be:
awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched
If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0'
instead of RS='^$'
or by appending to variable one line at a time as it's read and then processing that variable in the END section.
If your file_to_be_searched is too large to fit in memory then it'd be this for strings:
awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched
and the equivalent for regexps:
awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched
Solution 2
git grep
Here is the syntax using git grep
with multiple patterns:
git grep --all-match --no-index -l -e string1 -e string2 -e string3 file
You may also combine patterns with Boolean expressions such as --and
, --or
and --not
.
Check man git-grep
for help.
--all-match
When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.
--no-index
Search files in the current directory that is not managed by Git.
-l
/--files-with-matches
/--name-only
Show only the names of files.
-e
The next parameter is the pattern. Default is to use basic regexp.
Other params to consider:
--threads
Number of grep worker threads to use.
-q
/--quiet
/--silent
Do not output matched lines; exit with status 0 when there is a match.
To change the pattern type, you may also use -G
/--basic-regexp
(default), -F
/--fixed-strings
, -E
/--extended-regexp
, -P
/--perl-regexp
, -f file
, and other.
Solution 3
This gnu-awk
script may work:
cat fileSearch.awk
re == "" {
exit
}
{
split($0, null, "\\<(" re "\\>)", b)
for (i=1; i<=length(b); i++)
gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
exit (re != "")
}
Then use it as:
if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
echo "all strings were found"
else
echo "all strings were not found"
fi
Alternatively, you can use this gnu grep
solution with PCRE
option:
grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file
- Using
-z
we makegrep
read complete file into a single string. - We are using multiple lookahead assertions to assert that all the strings are present in the file.
- Regex must use
(?s)
orDOTALL
mod to make.*
match across the lines.
As per man grep
:
-z, --null-data
Treat input and output data as sequences of lines, each terminated by a
zero byte (the ASCII NUL character) instead of a newline.
Solution 4
First, you probably want to use awk
. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk
, but if you want to do it anyway...
This is based on the following assumptions:G
- Invoking AWK is unacceptable
- Invoking
grep
multiple times is unacceptable - The use of any other external tools are unacceptable
- Invoking
grep
less than once is acceptable - It must return success if everything is found, failure when not
- Using
bash
instead of external tools is acceptable -
bash
version is >= 3 for the regular expression version
This might meet all of your requirements: (regex version miss some comments, look at string version instead)
#!/bin/bash
multimatch() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$@" is useful
strings=( "$@" ) # search strings into an array
declare -a matches # Array to keep track which strings already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#strings[@]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
string="${strings[$i]}" # fetch the string
if [[ $line = *$string* ]]; then # check if it matches
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[@]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
multimatch_regex() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$@" is useful
regexes=( "$@" ) # Regexes into an array
declare -a matches # Array to keep track which regexes already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#regexes[@]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
regex="${regexes[$i]}" # Get regex from array
if [[ $line =~ $regex ]]; then # We use the bash regex operator here
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[@]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
if multimatch "filename" string1 string2 string3; then
echo "file has all strings"
else
echo "file miss one or more strings"
fi
if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
echo "file match all regular expressions"
else
echo "file does not match all regular expressions"
fi
Benchmarks
I did some benchmarking searching .c
,.h
and .sh
in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...]
and that an if
can be used to check the result)
Results: (measured with time
, real
time rounded to closest half second)
-
multimatch
: 49s -
multimatch_regex
: 55s - matchall: 10.5s
- fileMatchesAllNames: 4s
- awk (first version): 4s
- agrep: 4.5s
- Perl re (-r): 10.5s
- Perl non-re: 9.5s
- Perl non-re optimised: 5s (Removed Getopt::Std and regex support for faster startup)
- Perl re optimised: 7s (Removed Getopt::Std and non-regex support for faster startup)
- git grep: 3.5s
- C version (no regex): 1.5s
(Invoking grep
multiple times, especially with the recursive method, did better than I expected)
Solution 5
A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:
#!/bin/bash
patterns="$@"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
fi
}
for file in *
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
Usage:
./allfilter.sh cat filter java
test.sh
Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".
So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).
But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.
Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:
./allfilter.sh mutable tailrec main println
Performance:
ls *.scala | wc
89 89 2030
In 89 scala files, I have the keywords distribution:
for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done
16
34
41
71
Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:
time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala
real 0m0.216s
user 0m0.024s
sys 0m0.028s
in close to 15.000 codelines:
cat *.scala | wc
14913 81614 610893
update:
After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:
#!/bin/bash
filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
fi
}
echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):
for i in {1..20}
do
./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done
Related videos on Youtube
![codeforester](https://i.stack.imgur.com/ppP5k.jpg?s=256&g=1)
codeforester
Scripting and automation enthusiast. Fun fact: Code is the reason for bugs. More code = more bugs! Write reusable code!
Updated on June 16, 2022Comments
-
codeforester about 2 years
I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:
... string1 ... string2 ... string3 ... string1 string2 ... string1 string2 string3 ... string3 string1 string2 ... string2 string3 ... and so on
In the above example, we could have regexes in place of strings.
For example, the following code checks if any of my strings exists in the file:
if grep -EFq "string1|string2|string3" file; then # there is at least one match fi
How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.
Is it possible to do it without having to invoke
grep
multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool likeawk
orpython
?Also, is there a solution for strings that can easily be extended for regexes?
-
Ian McGowan about 6 yearsDuplicate question: unix.stackexchange.com/questions/55359/…
-
codeforester about 6 years@IanMcGowan: that's not a dupe. That post talks about the all patterns occurring together on the same line. My interest is to find them on the same line or on different lines.
-
Ian McGowan about 6 yearsAh, good point, my bad. I should have added a question mark.
-
user unknown about 6 yearsDo you have some typical, practical numbers in mind like numbers of patterns (4, 16, 64, ...) and files to search (10, 1000, 100000)?
-
codeforester about 6 years@userunknown: up to a few thousand patterns is reasonable.
-
user unknown about 6 years@codeforester: Wow, few thousands, that seems an important information. For such big numbers, it might get problematic to pass them as arguments via the shell. Maybe it is possible to split them into chunks of some hundreds of parameters, filter a filelist and get a filtered filelist from the first invocation, to produce a reduced filelist as input for the second invocation with the next chunk and so on. Instead of passing the files and patterns as paremters, one could pass filenames containing these, with few modifications this would be possible with my, but surely with other scripts too.
-
glerYbo about 6 years
-
Ed Morton about 6 yearsI'm completely aghast at how many times I see the word "pattern" in this Q&A and all of the answers that search for regexps using the word "string" instead. Saying you want to find a "pattern" is like losing your dog and putting up flyers that just say
"lost: 1 animal"
and saying you're working with a string when it's really a regexp is like handing someone a cat and trying to convince them it's really their missing dog! -
Gert van den Berg about 6 yearsWhy does this have an
awk
andpython
tag if it asks about doing it without them?
-
-
learningbee about 6 yearsThis would work, however, OP says he doesn't want to call
grep
multiple times. -
Ian McGowan about 6 yearsAcknowledged, but I don't think there's another way to do it with plain grep, and at least the -q and && will shortcut some of the multiple passes over the data files.
-
anubhava about 6 yearsNo
grep
solution works irrespective of any order of appearance of those strings in the file. Lookahead assertions just make sure those strings are present anywhere in the file. Sogrep
can also be written as:grep -qzP '(?s)(?=.*\bstring3\b)(?=.*\bstring1\b)(?=.*\bstring2\b)' file
-
anubhava about 6 yearsIndeed, your
gnu-awk
solution is neat -
codeforester about 6 yearsThis is a good answer. Started a bounty to attract even more attention.
-
Ed Morton about 6 yearsOk but you really should clarify if you want a solution for strings or regexps and whether you need "full word" matches only or partial matches.
-
Leon about 6 years
uniq
eliminates only adjacent duplicates. That's why I usedsort -u
in my answer. -
Leon about 6 yearsOP is looking for a solution that doesn't require invoking grep multiple times (per file).
-
user unknown about 6 years@Leon: That's a question, whether it is possible, and at least for all files, which don't match a former pattern, grep is not invoked any longer and for each match it is early stopped, so by an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast.
-
Anna about 6 years@Leon yes, absolutely makes sense. Thank you for correcting me!
-
Ed Morton about 6 yearsGood. Really think long and hard about what each solution is doing as your posted sample input doesn't cover all of the cases it needs to to prove if a solution works or not so while all solutions will produce the expected output given that sample input, most of them will fail given different input. So either take your time to figure out different sample input that covers all of your requirements (in particular files that should not match but are hard to handle) or really think through the logic of each solution to make sure it does address all of your needs.
-
Ed Morton about 6 yearsOh, and also make sure you understand why-is-using-a-shell-loop-to-process-text-considered-bad-practice if you're considering any solution that involves a shell loop, and consider what the given solution will do wrt memory and speed of execution for a huge file.
-
Gert van den Berg about 6 yearsSome benchmarking (say on the scala file example) would be interesting... It is likely significantly slower than AWK - it is mainly an exercise to show that the stated requirements can be met... (Speed was not mentioned as a requirement) (This also seems to use no external processes - that can be good for speed, but bash text processing, which can be bad for speed, at least compared to C code in grep...)
-
kvantour about 6 years@EdMorton Very nice answer! I am missing only one thing, what if the searched string is split by a newline. Imagine a textbook and the string you want to search for is broken apart by a newline. It is a bit more difficult but it could be a nice addition to this answer!
-
user unknown about 6 yearsThis reports multiple matches per file. But with
time git grep --name-only --all-match -e ...
it only reports the matching files and it is impressive fast - my solution was about 0.2s for my testcase, yours took 0.02s. And the option '-f patternfile` might be interesting too. Most important maybe: It is simple, clear and compact and surely gets over time more support than my homegrown script. -
Ed Morton about 6 yearsThanks, but there's nothing special about a newline, you just treat it like any other character. If the OP wants to search for a string that contains a newline then they just have to set the RS to whatever is the record separator instead of a newline (e.g.
\0
or^$
as I show above to search the whole file or maybe""
if blank lines separate the records or[[:punct:]]
or something else) and include a newline in the string that's being searched with, just like any other character. -
Ed Morton about 6 yearsThat script will not look for any strings, it will look for 4 regexps. If you want it to act as if it were looking for strings then see stackoverflow.com/q/29613304/1745001 for how to do that.
-
Ed Morton about 6 yearsYes, you're right, it would be several orders of magnitude slower than an awk script. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice. You use the word
pattern
in several places in your script - that's a highly ambiguous term which as such should generally be avoided so please clarify each time you use it if you are talking about astring
, aregexp
, aglobbing pattern
or something else. -
Gert van den Berg about 6 years@EdMorton: It seems to be the only way to avoid standard tools (like AWK) (or non-standard tools, like Perl) (I'm working on the assumption that the only acceptable non-shell tool is
grep
, invoked only once...) (The question asks whether it is "possible", not whether it is a good idea ;-) ) "Pattern" was used for the search string (which can either be a regex or a plain string matched with globbing, depending on the variant) (The regex version was done once I realised that bash had built-in regexes, and is a simple modification of the first version) (detailed comments was added after fork) -
Gert van den Berg about 6 years@EdMorton: A lot (but not all) of the criticism of the read method is invoking external tools multiple times, which this avoids. (It does need a blank IFS to keep it from stripping leading whitespace (I know that with a single variable, whitespace between terms are not affected by
$IFS
)) -
Gert van den Berg about 6 years(assuming "use" should be "using" in the quoted part not "to instead use")
-
kvantour about 6 yearsAnd this is why SO is so great. Everybody has different ways of approaching a problem and from time to time something unexpected pops up.
-
codeforester about 6 yearsInstead of using
awk
to create the expression forgrep
, we could definitely usegrep -f
which would work much better when we have a lot of strings to search for. That way, it would be easier to handle any special characters in the strings. -
Ed Morton about 6 yearsIgnoring the UUOCs, this searches for regexps, not strings. You're using awk to create a regexp like
regexp1|regexp2|...
that gets saved in the badly-named variableSTR
and then you're usinggrep -o
to find it. The grep-o
MUST do a regexp match since you're relying on the|
s in STR so every part ofSTR
is treated as a regexp and there's nothing you can do to fix that. It will also fail in several ways, e.g. when one "string" is part of another, e.g. try$ echo 'footherebar' | grep -E -o 'the|there'
and you'll find the output is justthere
, notthe
andthere
as required. ` -
Ed Morton about 6 yearsThe OP tells us she has thousands of strings to search for and so hard-coding them all as in your first script won't be practical (plus it's searching for regexps, not strings). Your second script is similar to mine in terms of using a counter but instead of reducing the array of strings on every loop like mine does, yours is creating a duplicate array of strings and so will be slower than mine and use twice as much memory. It wasn't me who downvoted btw.
-
revo about 6 yearsWhere am I creating duplicate array of strings? and I don't get you by * it's searching for regexps, not strings* since whatever the input is, a pattern or a literal string, it doesn't confuse. @EdMorton
-
Ed Morton about 6 yearsIn your first script
/string/
is defining a repexp literal with the contentsstring
and then doing a regexp comparison of that against$0
, i.e. it's searching for a regexp not a string. In your second script you have an arrayargs[]
that contains all of your strings and each time you match a string in the input you're adding it to the arraytemp[]
so when all strings are found in the input you end up withtemp[]
being a duplicate ofargs[]
. -
Ed Morton about 6 yearsOh, hang on - I misread your second script. I assumed you were using
args[]
to contain the arg strings as indices so you could do a simple loop on the indices and a string comparison like I'm doing but you aren't you're storing the strings as array contents and looping on numeric indices and then dereferencing the array each time and doing a regexp comparison! Soarg
in your second script isn't actually an arg (one of the strings passed in) it's an index and the arg/string is actually atargs[arg]
so you aren't creating a dup intemp[]
but there's other issues instead. -
revo about 6 yearsI guessed you did probably on if condition... but let's talk about strings vs regexps. I see it as one single solution for both. You provided two individually and I'm not sure how it acts inefficiently on strings vs regexps. Maybe a benchmark needed? @EdMorton
-
Ed Morton about 6 yearsTry both solutions when the string being searched for is
.*
. Mine will find the exact string.*
(i.e.a period followed by an asterisk) if and only if it exists in the input file while yours will match on the first line no matter what the input file contents since.*
will be treated as the regexp "zero or more repetitions of any character". wrt efficiency, I'm not considering the efficiency of regexp vs string comparisons, I'm removing each string from my array when it matches so the array gets smaller and so faster to loop through on every match. Feel free to do timing tests if you like. -
revo about 6 yearsIsn't it dependent on OP? in case of
.*
you are right but if OP doesn't have quantifiers or regex meta characters in his / her arguments then what would be the issue? and since a line is read at a time, even.*
which prints wrong substring, talking performance-wise, doesn't differ from matching a literal string likes
. Even if whole input string is one single line by the mean of\n
being the RS. @EdMorton -
Ed Morton about 6 yearsRight, if the OP chooses not to have any regexp metacharacters then there are no false matches produced by regexp metacharacters but why impose that restriction when you can just as easily do string comparisons? The OP said she wants to find strings so why then say 'ok but you can't simply look for "foo.c" in a list of file names', etc.. Again, when I'm talking about performance I'm not talking about the difference in a string vs regexp comparison, I'm talking about reducing the number of values being looped on for every time a value is found vs keeping the number of iterations constant.
-
Gert van den Berg about 6 yearsThe new version, can probably deal with plain strings (by adding a
-F
to thegrep
) (and without-F
or with-G
, standard regexes) (And can deal with extended regexes with-E
or other non-standard variants, like-P
for PCRE regexes in GNU grep) -
Ed Morton about 6 yearsIt still can't handle strings being contained in other strings though. Again, try any variation of
echo 'footherebar' | grep -E -o 'the|there'
and it'll findthere
but notthe
. -
Ed Morton about 6 yearsBest I can tell
git grep -q --all-match --no-index -l -F -f strings.txt file
seems like it'd work just fine as long as you don't mind installing git tools. Does it read the whole file into memory at one time or does it store 1 line at a time in memory? Is there any limit on how many strings can be searched for at one time? -
glerYbo about 6 years@EdMorton There are no string limits, it'll just take longer. In terms of memory management, I'm not sure, you can check the source code of
grep.c
(look forall_match
). But in overall, it's designed to be quick. -
Ed Morton about 6 yearsThanks for doing the benchmarking. Since the OP is searching for thousands of strings could you try again with a large number of strings (e.g. at least 1,000, all of which appear in the target file,and some of which are subsets of each other and some of which contain regexp metachars)? There's huge differences between how the various solutions will perform as the number of strings to be searched for gets large (and match) plus some solutions will fail given strings that are substrings of other strings or strings that contain RE chars and those differences won't show up for just those 3 strings
-
Ed Morton about 6 yearsThanks, best I can tell from reading that source file it always reads the whole input file into memory.
-
Gautam about 6 yearsYes, Strangely, if the contents of my strings.txt is "there\nthe" , it matches both the strings, whereas if it is "the\nthere" it matches only one. :P . I have modified my answer to say it isn't able to though.
-
Gert van den Berg about 6 years@EdMorton: It becomes tricky - many of the solutions have different interfaces that becomes tricky (and possibly slow) to map. The CLI ones might need a non-shell method to invoke them to build a
argv
close toARG_MAX
(Although they can be arbitrarily combined if they have proper exit codes with&&
operators, with the disadvantage of multiple scans of the file if the first part matches) -
Gert van den Berg about 6 years(The output also needs to be checked) (Those strings (and extensions) on the test set was mainly chosen that some files relatively quickly match all strings (to disadvantage things that scan entire files multiple times irrespective of matches) and have some files unlikely to match all strings (The
.sh
ones) - for performance when the whole file needs to be scanned) (I started with the entire source but ran out of patience waiting for my shell ones) -
Gert van den Berg about 6 yearsTo do a search without regexes, you can use
index
(see your localawk
man page) instead of=~
. Two variants is one option, another is to have a setting that chooses whetherindex
or=~
is used. -
Ed Morton about 6 yearsI bet you don't get a performance improvement over the awk solution since awk is highly optimized for text processing, usually out-performing compiled languages like "C" or that task. Remember we're looking for strings btw, not regexps.
-
Ed Morton about 6 yearsIs that really looking for strings or is it looking for regexps? I mean if string1 was
".*"
would it only match against the string".*"
in the input or would it match against any sequence of characters? -
Ed Morton about 6 years@GertvandenBerg Right, you can use index() to do a string instead of regexp comparison and you can delete the matched strings from args instead of adding flags to temp to improve performance and then you end up with my answer
-
Gert van den Berg about 6 years@EdMorton: For non-regexes, a C version based on this on
mmap
ed files runs my benchmark in around 1.5 seconds... (Usingargv
for search strings, not a file) (Extending the simple C case to regexes is not that easy though....) -
Ed Morton about 6 yearsAgain, it's important to be searching for 1,000+ strings like the OP says is realistic.Searching for, say, 3 strings really isn't very useful wrt comparative timing.
-
binish about 6 yearsIts technically regexps, but in this context it is looking for strings. If there are any metachars, you need to escape appropriately.
-
Ed Morton about 6 yearsDoesn't perl have some option to do the escaping for you (like shells
printf '%q'
) so your regexps kinda-sorta get string-like treatment? I thought I saw that somewhere. In any case, its very misleading to use the name "string" for your regexps. Seems to be a common theme in this thread though, idk why.... -
Gene about 6 years@EdMorton Awk is a great tool, but it uses a regex interpreter. On 5,000 patterns, it will proceed by attempting to match each of the 5,000 patterns in sequence for each input character. Flex will compile all 5000 patterns into a single DFA that executes a few instructions per input character. That's why compiler scanners - where performance impacts every program ever compiled for the life of the scanner - are implemented with DFAs, not regex engines.
-
Ed Morton about 6 yearsWe are not working with regexps in this question we are working with strings. Awks regexp engine is not being used and even if it were it would not be working with a single regexp and the algorithm used would reduce the number of regexps each time one is found.
-
Gene about 6 years@EdMorton Actually the question says "we could have regexes in place of strings." Whether the patterns are simple strings or use alternation and Kleene star, my point is the same. For each character, an interpreter will - for each character - try each pattern in sequence until it finds a match. A DFA will do the equivalent of a C
switch
branch on each character and then move on to the next one. -
Ed Morton about 6 yearsI took that as meaning the strings could contain regexp metacharacters but they should be treated as literal strings regardless. That's fine but that's not what a reasonable awk solution to the problem of finding multiple strings in a file will do, it'll simply loop through an array of those strings and remove each one that's found in the input file until the array is empty or the end of file is reached. If you think your approach will be faster then just try it and let us know the results.
-
codeforester about 6 yearsIs this optimal for large files?
-
codeforester about 6 yearsOne of the downsides here is that it reads the entire file always and doesn't stop as soon as all strings match.
-
codeforester about 6 yearsProbably
cmp
is better thandiff
in this context. -
kevinarpe over 2 yearsThis is an excellent answer. Thank you to share! However, I found a strange issue. If
<pathspec>
has one or more absolute paths, git will crash with an error like:BUG: environment.c:202: git environment hasn't been setup
However, relative paths do not cause this issue. I am usinggit grep
to search multiple absolute paths with multiple logic-and regexp. Can this be done?