Python searching a large list speed

10,449

Solution 1

Two things that might provide some small help:

1) Use the approach in this SO answer to read through your large file the most efficiently.

2) Change your code from

for x in headwordList:
    m = SequenceMatcher(None, y.lower(), 1)

to

yLower = y.lower()
for x in headwordList:
    m = SequenceMatcher(None, yLower, 1)

You're converting each sentence to lower 650,000 times. No need for that.

Solution 2

You should change headwordList into a set.

The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).

With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.

Now, this whole loop can be simplified:

     for x in headwordList:
         m = SequenceMatcher(None, y.lower(), x)

         if m.ratio() > percentage:
             percentage = m.ratio()

             word = x

     if percentage > 0.86:        
         sentenceList[count] = word

All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.

def check_ratio(m):
    return m.ratio()

y = y.lower()  # do the .lower() call one time
m, word =  max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio())  # remember best ratio
if m.ratio() > 0.86:
    setence_list.append(word)

This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.

In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/

It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.

Also, you might do better to build a dictionary of words and their ratios.

Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.

EDIT: This is to answer a question about generator expressions needing to be parenthesized.

Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:

def check_ratio(m):
    return m.ratio()

y = y.lower()  # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word =  max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio())  # remember best ratio
if m.ratio() > 0.86:
    setence_list.append(word)

That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:

m, word =  max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)

Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.

But I think it's easier to read if you split out the genexp on its own line.

EDIT: @Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.

Happily, the code got simpler! Always a good sign.

EDIT: I just tested the code. This code works for me; see if it works for you.

from difflib import SequenceMatcher

headwords = [
# This is a list of 650,000 words
# Dummy list:
    "happy",
    "new",
    "year",
]


def words_from_file(filename):
    with open(filename, "rt") as f:
        for line in f:
            for word in line.split():
                yield word

def _match(matcher, s):
    matcher.set_seq2(s)
    return (matcher.ratio(), s)

ratios = {}
best_ratio = 0

matcher = SequenceMatcher()

for word in words_from_file("sentences.txt"):
    matcher.set_seq1(word.lower())
    if word not in headwords:
        ratio, word =  max(_match(matcher, word.lower()) for word in headwords)
        best_ratio = max(best_ratio, ratio)  # remember best ratio
        if ratio > 0.86:
            ratios[word] = ratio

print(best_ratio)
print(ratios)

Solution 3

1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.

2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.

3) You construct sentenceList which doesn't need to be done.

for line in file:
   if line not in headwordList...

4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist

Share:
10,449
English Grad
Author by

English Grad

Updated on June 05, 2022

Comments

  • English Grad
    English Grad almost 2 years

    I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:

    from difflib import SequenceMatcher
    headWordList = [ #This is a list of 650,000 words]
    
    
    openFile = open("sentences.txt","r")
    
    for line in openFile:
        sentenceList.append[line]
    
    percentage = 0
    count = 0
    
    for y in sentenceList:
          if y not in headwordList:
    
             for x in headwordList:
                 m = SequenceMatcher(None, y.lower(), x)
    
                 if m.ratio() > percentage:
                     percentage = m.ratio()
    
                     word = x
    
             if percentage > 0.86:        
                 sentenceList[count] = word
    count=count+1
    

    Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.

  • steveha
    steveha over 10 years
    This is not a useful answer to a Python question. Python includes several built-ins that can be usefully applied to this question (set and dict). @English Grad says that he/she is not an expert software engineer, so a page showing how to implement a data structure is not going to help, and even an expert software developer would be better served by using a Python built-in than by implementing a tree data structure.
  • English Grad
    English Grad over 10 years
    Steveha, I find this approach interesting and I am trying to use it but I am getting and error message that says: "generator expression must be parenthesized if not sole argument" ...any ideas?
  • Peter Wood
    Peter Wood over 10 years
    Also, the docs suggest reusing a SequenceMatcher: 'SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.'