Efficiently search sorted file
Solution 1
There's an essential difference between grep
and look
:
Unless explicitly stated otherwise, grep
will find patterns even somewhere within the lines. For look
the manpage states:
look — display lines beginning with a given string
I'm not using look
very often, but it did work fine on a trivial example I just tried.
Solution 2
Maybe a little late answer:
Sgrep will help you.
Sgrep (sorted grep) searches sorted input files for lines that match a search key and outputs the matching lines. When searching large files sgrep is much faster than traditional Unix grep, but with significant restrictions.
- All input files must be sorted regular files.
- The sort key must start at the beginning of the line.
- The search key matches only at the beginning of the line.
- No regular expression support.
You can download source here: https://sourceforge.net/projects/sgrep/?source=typ_redirect
and the documents here: http://sgrep.sourceforge.net/
Another Way:
I don't know how large is the file.Maybe you should try parallel:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
I always do grep with files which size > 100GB, it works well.
Related videos on Youtube
Matt
Updated on September 18, 2022Comments
-
Matt over 1 year
I have a large file containing one string on each line. I would like to be able to quickly determine if a string is in the file. Ideally, this would be done using a binary chop type algorithm.
Some Googling revealed the
look
command with the-b
flag which promises to locate and output all strings beginning with a given prefix using a binary search algorithm. Unfortunately, it doesn't seem to work correctly and returns null results for strings that I know are in the file (they are properly returned by the equivalentgrep
search).Does anyone know of another utility or strategy to search this file efficiently?
-
Sur3 over 6 yearsThe top answer states the wrong sorting: the fact is you have to sort with: LC_COLLATE=C sort -d for the
look
command to function correctly, because look seems to ignore the locale and just uses C like sorting hardcoded, I also opened a bug because of this confusing behaviour: bugzilla.kernel.org/show_bug.cgi?id=198011 -
Ernest.Huang about 6 years
look -b
failed for me with an errorFile too large
. I think it's trying to read the whole thing into memory.
-
-
Matt about 10 yearsThe file I need to search has around 110,000,000 lines. If I do
egrep "^TEST" sortedlist.txt | wc -l
I get 41,289 results. However the equivalentlook
commands,look -b TEST sortedlist.txt | wc -l
yields only 1995 results. I almost wonder if there is a bug inlook
. -
kasperd over 8 years@Matt Maybe
look
is using different collation settings than the program you used to sort the file. -
ingomueller.net almost 7 yearsThe
sgrep
in the Ubuntu repositories is actually this sgrep, which is designed to "search a file for a structured pattern" and has nothing to do with binary search. -
muru almost 6 yearsIsn't that already in askubuntu.com/a/701237/158442?
-
memorybox almost 6 yearsyes, i fill in the download link...
-
muru almost 6 yearsIf that's all, you should edit that post instead of posting a new answer.
-
memorybox almost 6 yearsthat post recommended:
sudo apt-get install sgrep
to get sgrep, The sgrep in the buntu repositories is not actually this sgrep, i'm not sure it's the same thing.