Efficiently search sorted file

7,084

Solution 1

There's an essential difference between grep and look:

Unless explicitly stated otherwise, grep will find patterns even somewhere within the lines. For look the manpage states:

look — display lines beginning with a given string

I'm not using look very often, but it did work fine on a trivial example I just tried.

Solution 2

Maybe a little late answer:

Sgrep will help you.

Sgrep (sorted grep) searches sorted input files for lines that match a search key and outputs the matching lines. When searching large files sgrep is much faster than traditional Unix grep, but with significant restrictions.

  • All input files must be sorted regular files.
  • The sort key must start at the beginning of the line.
  • The search key matches only at the beginning of the line.
  • No regular expression support.

You can download source here: https://sourceforge.net/projects/sgrep/?source=typ_redirect

and the documents here: http://sgrep.sourceforge.net/

Another Way:

I don't know how large is the file.Maybe you should try parallel:

https://stackoverflow.com/questions/9066609/fastest-possible-grep

I always do grep with files which size > 100GB, it works well.

Share:
7,084

Related videos on Youtube

Matt
Author by

Matt

Updated on September 18, 2022

Comments

  • Matt
    Matt over 1 year

    I have a large file containing one string on each line. I would like to be able to quickly determine if a string is in the file. Ideally, this would be done using a binary chop type algorithm.

    Some Googling revealed the look command with the -b flag which promises to locate and output all strings beginning with a given prefix using a binary search algorithm. Unfortunately, it doesn't seem to work correctly and returns null results for strings that I know are in the file (they are properly returned by the equivalent grep search).

    Does anyone know of another utility or strategy to search this file efficiently?

    • Sur3
      Sur3 over 6 years
      The top answer states the wrong sorting: the fact is you have to sort with: LC_COLLATE=C sort -d for the look command to function correctly, because look seems to ignore the locale and just uses C like sorting hardcoded, I also opened a bug because of this confusing behaviour: bugzilla.kernel.org/show_bug.cgi?id=198011
    • Ernest.Huang
      Ernest.Huang about 6 years
      look -b failed for me with an error File too large. I think it's trying to read the whole thing into memory.
  • Matt
    Matt about 10 years
    The file I need to search has around 110,000,000 lines. If I do egrep "^TEST" sortedlist.txt | wc -l I get 41,289 results. However the equivalent look commands, look -b TEST sortedlist.txt | wc -l yields only 1995 results. I almost wonder if there is a bug in look.
  • kasperd
    kasperd over 8 years
    @Matt Maybe look is using different collation settings than the program you used to sort the file.
  • ingomueller.net
    ingomueller.net almost 7 years
    The sgrep in the Ubuntu repositories is actually this sgrep, which is designed to "search a file for a structured pattern" and has nothing to do with binary search.
  • muru
    muru almost 6 years
    Isn't that already in askubuntu.com/a/701237/158442?
  • memorybox
    memorybox almost 6 years
    yes, i fill in the download link...
  • muru
    muru almost 6 years
    If that's all, you should edit that post instead of posting a new answer.
  • memorybox
    memorybox almost 6 years
    that post recommended: sudo apt-get install sgrep to get sgrep, The sgrep in the buntu repositories is not actually this sgrep, i'm not sure it's the same thing.