How to use grep efficiently?
Solution 1
If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.
Environment:
Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB
Tests:
1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8
real 3m24.358s
user 1m27.654s
sys 9m40.316s
2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings
real 16m3.051s
user 0m56.012s
sys 8m42.540s
3. Suggested by @Stephen: Find the necessary files and use + instead of xargs
time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings
real 53m45.438s
user 0m5.829s
sys 0m40.778s
4. Regular recursive grep.
grep -R "string" >> Strings
real 235m12.823s
user 38m57.763s
sys 38m8.301s
For my purposes, the first command worked just fine.
Solution 2
Wondering why -n1
is used below won't it be faster to use a higher value (say -n8? or leave it out so xargs will do the right thing)?
xargs -0 -n1 -P8 grep -H "string"
Seems it will be more efficient to give each grep that's forked to process on more than one file (I assume -n1 will give only one file name in argv for the grep) -- as I see it, we should be able to give the highest n possible on the system (based on argc/argv
max length limitation). So the setup cost of bringing up a new grep process is not incurred more often.
Comments
-
Legend over 3 years
I have a large number of small files to be searched. I have been looking for a good de-facto multi-threaded version of
grep
but could not find anything. How can I improve my usage of grep? As of now I am doing this:grep -R "string" >> Strings
-
SiegeX about 13 yearsMight I suggest you use find's
-print0
with xarg's-0
to delimit file names with the NUL character so you don't get into trouble with filenames with spaces, newlines or other garbage characters in their name. -
Legend about 13 years@SiegeX: +1 Thank you for the pointers. I have edited my answer.
-
armandino about 13 years+2 interesting answer. Cheers. A.
-
ken about 13 yearsnice! would try to use this more often :)
-
Stephen P about 13 yearsI'd like to see the results over the same fileset of
time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings_findExec
(the\+
terminating the find doing essentially the same as thefind|xargs
combo) -
SiegeX about 13 years@Stephen Not quite the same,
xargs
allows you to utilize multiprocessor capability with the-P
flag whereas the POSIX-2004 compliant versions offind
that can be terminated with+
act the same as if you were to pass-P1
toxargs
, i.e. only one processor utilization -
Stephen P about 13 years@SiegeX That's what I was wondering - would the timings come out the same as the
grep -R "string" >> Strings
version? -
Legend about 13 years@Stephen: I updated my post with the new results. :) Not sure about the behavior though.
-
fpmurphy about 13 yearsIf you have a multicore CPU, you could pipe the output of find to GNU parallel to do parallel greping.
-
Legend about 13 years@fpmurphy: Actually, on the system that I am running my experiments, parallel is not installed and I was used to xargs :) But thanks for the tip though!
-
Christian over 11 years@fpmurphy isn't that exactly what
xargs -P
does? After I heard of this switch toxargs
, I never really understood the purpose of GNU parallel. -
Christian over 11 yearsLeaving it out is not a good idea since this will break if
find
returns a lot of files and even if this doesn't occur, I doubtxargs
has a very good way of knowing beforehand how many files to give to each instance of grep. The forking overhead will depend on the file average size of course and an unlucky combination of n small files for onegrep
and n very large one for the other could eat up even n=2 very easily. On my downloads dir, I picked a value of n=64 out of thin air though and gained 3% though. -
Thor over 10 years@Christian: here is a link to GNU parallel documentation that compares xargs and parallel.
-
Watt over 10 yearsUsing this approach, I get error with large number of files stackoverflow.com/questions/19694379/…
-
Jérôme Pouiller about 8 yearsWhy adding
-n1
toxargs
? On my tests, it is 20 times faster without this option (or with a fair value like-n 1024
).