Python searching a large list speed
Solution 1
Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.
Solution 2
You should change headwordList
into a set
.
The test word in headwordList
will be very slow. It must do a string comparison on each word in headwordList
, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).
With a set
, it always takes the same amount of time to do the in
test; it doesn't depend on the number of elements in the set
. So that will be a huge speedup.
Now, this whole loop can be simplified:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
All this does is find the word from headwordList
that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList
to just headwords
as I want you to make it be a set
and not a list
.
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max()
function to find the SequenceMatcher
result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords
, calling SequenceMatcher()
on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher
result and the second value is the word. The max()
function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key=
argument. Now max()
finds the value with the highest ratio for us. max()
consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m
and word
.
In Python, it is best practice to use variable names like sentence_list
rather than sentenceList
. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append()
method function to append values.
Also, you might do better to build a dictionary of words and their ratios.
Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList
no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.
EDIT: This is to answer a question about generator expressions needing to be parenthesized.
Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word = max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key=
argument, we need a fully parenthesized generator expression.
But I think it's easier to read if you split out the genexp on its own line.
EDIT: @Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher
for speed. I don't have time to test this, but I think this is the right way to do it.
Happily, the code got simpler! Always a good sign.
EDIT: I just tested the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher
headwords = [
# This is a list of 650,000 words
# Dummy list:
"happy",
"new",
"year",
]
def words_from_file(filename):
with open(filename, "rt") as f:
for line in f:
for word in line.split():
yield word
def _match(matcher, s):
matcher.set_seq2(s)
return (matcher.ratio(), s)
ratios = {}
best_ratio = 0
matcher = SequenceMatcher()
for word in words_from_file("sentences.txt"):
matcher.set_seq1(word.lower())
if word not in headwords:
ratio, word = max(_match(matcher, word.lower()) for word in headwords)
best_ratio = max(best_ratio, ratio) # remember best ratio
if ratio > 0.86:
ratios[word] = ratio
print(best_ratio)
print(ratios)
Solution 3
1) I would store headwordList
as a set, not a list, allowing for faster access as it is a hashed data structure.
2) You have sentenceList
defined as a list then attempt to use it as a dictionary with sentenceList[x] = y
. I would define a different structure specifically for counts.
3) You construct sentenceList
which doesn't need to be done.
for line in file:
if line not in headwordList...
4) You never tokenize line
which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist
English Grad
Updated on June 05, 2022Comments
-
English Grad almost 2 years
I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher headWordList = [ #This is a list of 650,000 words] openFile = open("sentences.txt","r") for line in openFile: sentenceList.append[line] percentage = 0 count = 0 for y in sentenceList: if y not in headwordList: for x in headwordList: m = SequenceMatcher(None, y.lower(), x) if m.ratio() > percentage: percentage = m.ratio() word = x if percentage > 0.86: sentenceList[count] = word count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.
-
steveha over 10 yearsThis is not a useful answer to a Python question. Python includes several built-ins that can be usefully applied to this question (
set
anddict
). @English Grad says that he/she is not an expert software engineer, so a page showing how to implement a data structure is not going to help, and even an expert software developer would be better served by using a Python built-in than by implementing a tree data structure. -
English Grad over 10 yearsSteveha, I find this approach interesting and I am trying to use it but I am getting and error message that says: "generator expression must be parenthesized if not sole argument" ...any ideas?
-
Peter Wood over 10 yearsAlso, the docs suggest reusing a
SequenceMatcher
: 'SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.'