Most efficient way for a lookup/search in a huge list (python)
Solution 1
Don't create a list
, create a set
. It does lookups in constant time.
If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect
module.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
Solution 2
A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.
Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.
Solution 3
Using this program it looks like dicts are the fastes, set second, list with bi_contains third:
from datetime import datetime
def ReadWordList():
""" Loop through each line in english.txt and add it to the list in uppercase.
Returns:
Returns array with all the words in english.txt.
"""
l_words = []
with open(r'c:\english.txt', 'r') as f_in:
for line in f_in:
line = line.strip().upper()
l_words.append(line)
return l_words
# Loop through each line in english.txt and add it to the l_words list in uppercase.
l_words = ReadWordList()
l_words = {key: None for key in l_words}
#l_words = set(l_words)
#l_words = tuple(l_words)
t1 = datetime.now()
for i in range(10000):
#w = 'ZEBRA' in l_words
w = bi_contains(l_words, 'ZEBRA')
t2 = datetime.now()
print('After: ' + str(t2 - t1))
# list = 41.025293 seconds
# dict = 0.001488 seconds
# set = 0.001499 seconds
# tuple = 38.975805 seconds
# list with bi_contains = 0.014000 seconds
user229269
Updated on December 29, 2020Comments
-
user229269 over 3 years
-- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my question is:
What is the most efficient way for such a lookup?
A first approach is to sort the list (
list.sort()
) and then just use>> if word in list: print 'word'
which is really trivial and I am sure there is a better way to do it. My goal is to apply a fast lookup that finds whether a given string is in this list or not. If you have any ideas of another data structure, they are welcome. Yet, I want to avoid for now more sophisticated data-structures like Tries etc. I am interested in hearing ideas (or tricks) about fast lookups or any other python library methods that might do the search faster than the simple
in
.And also i want to know the index of the search item
-
user229269 about 14 yearsThanks a lot THC4k for your detailed response. Actually I was thinking to apply a binary search myself but as I see that is what the bisect module kind of does anyway, so you saved my time :). Again thanks for your help.
-
Mike Graham about 14 years@user229269, you latched on to the wrong part of the post! You probably want a
set
, not alist
at all. -
user229269 about 14 years@Mike Graham I know what you are saying, but I am afraid I might run into memory problems if I use sets, considering that my list is actually a fast growing word-list that is going to end up being as large as 100.000 strings and more
-
Mike Graham about 14 years@user229269, 100000 items isn't that many. Using a
set
instead of alist
for that many items should only increase memory usage by <2MB, which isn't really all that much on modern hardware. If your data did grow so large using aset
would cause memory problems, you'd probably want to look into using a very different technique, such as storing the data in a database. -
user229269 about 14 yearsYeah, actually you (@Mike Graham) are right :) -- I used sets already. Thanks a lot for making me reconsider it
-
Cometsong about 6 yearsSurprised by dicts being faster. Next question is how long does take to generate the 'l_words' objects. +1!
-
inverted_index over 2 yearsGreat approach! thanks, reduced lookup of over 1M entities from ~3 hours to a couple of seconds! :)