How to use grep efficiently?

23,400

Solution 1

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.

Environment:

Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB

Tests:

1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8

real    3m24.358s
user    1m27.654s
sys     9m40.316s

2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings

real    16m3.051s
user    0m56.012s
sys     8m42.540s

3. Suggested by @Stephen: Find the necessary files and use + instead of xargs

time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings

real    53m45.438s
user    0m5.829s
sys     0m40.778s

4. Regular recursive grep.

grep -R "string" >> Strings

real    235m12.823s
user    38m57.763s
sys     38m8.301s

For my purposes, the first command worked just fine.

Solution 2

Wondering why -n1 is used below won't it be faster to use a higher value (say -n8? or leave it out so xargs will do the right thing)?

xargs -0 -n1 -P8 grep -H "string"

Seems it will be more efficient to give each grep that's forked to process on more than one file (I assume -n1 will give only one file name in argv for the grep) -- as I see it, we should be able to give the highest n possible on the system (based on argc/argv max length limitation). So the setup cost of bringing up a new grep process is not incurred more often.

Share:
23,400
Legend
Author by

Legend

Just a simple guy :)

Updated on August 05, 2020

Comments

  • Legend
    Legend over 3 years

    I have a large number of small files to be searched. I have been looking for a good de-facto multi-threaded version of grep but could not find anything. How can I improve my usage of grep? As of now I am doing this:

    grep -R "string" >> Strings
    
  • SiegeX
    SiegeX about 13 years
    Might I suggest you use find's -print0 with xarg's -0 to delimit file names with the NUL character so you don't get into trouble with filenames with spaces, newlines or other garbage characters in their name.
  • Legend
    Legend about 13 years
    @SiegeX: +1 Thank you for the pointers. I have edited my answer.
  • armandino
    armandino about 13 years
    +2 interesting answer. Cheers. A.
  • ken
    ken about 13 years
    nice! would try to use this more often :)
  • Stephen P
    Stephen P about 13 years
    I'd like to see the results over the same fileset of time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings_findExec (the \+ terminating the find doing essentially the same as the find|xargs combo)
  • SiegeX
    SiegeX about 13 years
    @Stephen Not quite the same, xargs allows you to utilize multiprocessor capability with the -P flag whereas the POSIX-2004 compliant versions of find that can be terminated with + act the same as if you were to pass -P1 to xargs, i.e. only one processor utilization
  • Stephen P
    Stephen P about 13 years
    @SiegeX That's what I was wondering - would the timings come out the same as the grep -R "string" >> Strings version?
  • Legend
    Legend about 13 years
    @Stephen: I updated my post with the new results. :) Not sure about the behavior though.
  • fpmurphy
    fpmurphy about 13 years
    If you have a multicore CPU, you could pipe the output of find to GNU parallel to do parallel greping.
  • Legend
    Legend about 13 years
    @fpmurphy: Actually, on the system that I am running my experiments, parallel is not installed and I was used to xargs :) But thanks for the tip though!
  • Christian
    Christian over 11 years
    @fpmurphy isn't that exactly what xargs -P does? After I heard of this switch to xargs, I never really understood the purpose of GNU parallel.
  • Christian
    Christian over 11 years
    Leaving it out is not a good idea since this will break if find returns a lot of files and even if this doesn't occur, I doubt xargs has a very good way of knowing beforehand how many files to give to each instance of grep. The forking overhead will depend on the file average size of course and an unlucky combination of n small files for one grep and n very large one for the other could eat up even n=2 very easily. On my downloads dir, I picked a value of n=64 out of thin air though and gained 3% though.
  • Thor
    Thor over 10 years
    @Christian: here is a link to GNU parallel documentation that compares xargs and parallel.
  • Watt
    Watt over 10 years
    Using this approach, I get error with large number of files stackoverflow.com/questions/19694379/…
  • Jérôme Pouiller
    Jérôme Pouiller about 8 years
    Why adding -n1 to xargs? On my tests, it is 20 times faster without this option (or with a fair value like -n 1024).