Anything faster than grep?

24,327

grep is I/O bound, meaning its speed is dominated by how fast it can read the files it is searching. Multiple searches in parallel can compete with each other for disk I/O, so you may not see much speedup.

If you just need matching filenames, and not the actual matches found in the files, then you should run grep with the -l flag. This flag causes grep to just print filenames that match, and not print the matching lines. The value here is that it permits grep to stop searching a file once it has found a match, so it could reduce the amount of work that grep has to do.

If you're searching for fixed strings rather than regular expressions, then you could try using fgrep rather than grep. Fgrep is a variant of grep that searches for fixed strings, and searching for fixed strings is faster than running a regular expression search. You may or may not see any improvement from this, because modern versions of grep are probably smart enough to optimize fixed-string searches anyway.

If you want to try running multiple searches in parallel, you could do it using shell utilities. One way would be to build a list of filenames, split it into parts, and run grep separately for each list:

find /path/to/files -type f -print | split -l 10000000 list.
for file in list.*; do
    grep -f ${file} -l 'some text' > ${file}.out &
done
wait
cat $*.out > filepaths.log
rm list.*

This uses find to find the files, splits the list of filenames into groups of ten million, and runs grep in parallel for each group. The output of the greps are all joined together at the end. This ought to work for files with typical names, but it'd fail for files that had newlines in their names for example.

Another approach uses xargs. First, you'd have to write a simple shell script that runs grep in the background:

#!/bin/bash
grep -l 'search text' "$@" >> grep.$$.out &

This will run grep on the list of files specified as arguments to the script, writing the result to a file named after the process's PID. The grep process runs in the background.

Then you'd run the script like this:

find /path/to/files -type f -print0 | xargs -0 -r /my/grep/script
[ wait for those to finish ]
cat grep.*.out > filepaths.log
rm grep.*.out

In this case, xargs will bundle the filenames into groups and run the script once for each group. The script will run an instance of grep once for each group. Once all of the grep instances have finished, you can combine their outputs. Unfortunately, I couldn't think of a clever way to automatically wait for the grep instances to finish here, so you might have to do that manually.

Share:
24,327

Related videos on Youtube

Jeffrey L. Roberts
Author by

Jeffrey L. Roberts

Updated on September 18, 2022

Comments

  • Jeffrey L. Roberts
    Jeffrey L. Roberts over 1 year

    I am looking for a tool that will be faster than grep, maybe a multi-threaded grep, or something similar... I have been looking at a bunch of indexers, but I am not sold that I need an index...

    I have about 100 million text files, that I need to grep for exact string matches, upon finding a string match, I need the filename where the match was found.

    ie: grep -r 'exact match' > filepaths.log
    

    Its about 4TB of data, and I started my first search 6 days ago, and grep is still running. I have another dozen searches to go and I can't wait 2 months to retrieve all these filenames =]

    I've reviewed the following, however, I don't think I need all the bells and whistles these indexers come with, I just need the filename where the match occurred...

    • dtSearch
    • Terrier
    • Lucene
    • Xapian
    • Recoil
    • Sphinx

    and after spending hours reading about all those engines, my head is spinning, and I wish I just had a multi-threaded grep lol, any ideas, and/or suggestions are greatly appreciated!

    PS: I am running CentOS 6.5

    EDIT: Searching for multi-threaded grep returns several items, My question is, is a multi-threaded grep the best option for what I am doing?

    EDIT2: After some tweaking, this is what I have come up with, and it is going much faster than the regular grep, I still wish it was faster though... I am watching my disk io wait, and its not building up yet, I may do some more tweaking, and def still interested in any suggestions =]

    find . -type f -print0 | xargs -0 -n10 -P4 grep -m 1 -H -l 'search string'
    
    • Daniel B
      Daniel B over 9 years
      So you actually plan to search for more than one string, right? Multithreading won’t help because you’re limited by disk throughput and (more importantly) seek performance.
    • Jeffrey L. Roberts
      Jeffrey L. Roberts over 9 years
      ya, The disks are def the bottleneck here
    • Tyson
      Tyson over 9 years
      "100 million text files"... really? and the approach to take really all depends on whether this is a one time thing or whether the data really needs to be indexed for future use.
    • Jeffrey L. Roberts
      Jeffrey L. Roberts over 9 years
      ya... really. lol =] It's more or less a one time thing for about 2 dozen searches in total
  • Ričardas Sabaliauskas
    Ričardas Sabaliauskas almost 8 years
    If you don't need regex, one benefit of fgrep is you don't have to worry about escaping reserved characters, e.g. fgrep '..' instead of grep '\.\.'.
  • Christophe Roussy
    Christophe Roussy about 6 years
    If files are on several drives maybe, but this is IO bound, not CPU bound.
  • iAdjunct
    iAdjunct about 5 years
    Grep isn't always I/O bound. I'm currently running a CPU-bound grep.
  • user2284570
    user2284570 about 3 years
    @Kenster in the area of Gb/s ssd, not being able to process a 30Tib single file using multithreads is really annoying.