Regex-matching a dictionary efficiently in Python

12,242

Solution 1

Compiling the regular expressions once rather than each time you use them would probably gain you a fair bit of performance improvement. So you'd have something like:

import re

dicti={'the':20, 'a':10, 'over':2}
patterns=['the', 'an?'] 
regex_matches = [re.compile("^"+pattern+"$").match for pattern in patterns]

extractddicti= {k:v for k,v in dicti.items()
                if any (regex_match(k) for regex_match in regex_matches)} 

Solution 2

What about joining the dict.keys in a single string to reduce number of loops? It seems faster here:

import re, random, string

#Generate random dictionary
dicti={''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)):i for i in range(1000000)}

#Using ; as separator, you can find any match with [^;]
regex_list=['TH[^;]*', 'A[^;]*'] 
joined_keys = ';' + ';'.join(dicti.keys()) + ';'

extractddicti = {i[1:-1]:dicti[i[1:-1]] for sublist in 
                [re.findall(';'+k+';', joined_keys) for k in regex_list] 
                for i in sublist}

Timeit results for 10 loops:

╔════════════╦════════════════════╗
║ Algorithm  ║     Time (sec)     ║
╠════════════╬════════════════════╣
║ Mine       ║ 2.116118321594911  ║
║ E. Gordon  ║ 20.956314717412653 ║
╚════════════╩════════════════════╝

UPDATE

As @E. Gordon suggested you should change separator to newline \n, then you are able to use ^ and $ special operators with re.MULTILINE.

regex_list=['.*TH', 'AN.*'] 

joined_keys = '\n'.join(dicti.keys())   
all_regex = "^" + "$|^".join(regex_list) + "$" 
matched_keys = re.findall(all_regex, joined_keys, re.MULTILINE)

dicti_match = {k:dicti[k] for k in matched_keys}

Solution 3

Your problem looks very much like the ad blocking problem. The Easylist is similar to your regex list but includes urls and url patterns.

In general regular expression matching is relatively fast. Yet, as you have already seen, it is not as fast as not running it, also common regular expression libraries are not optimized for large disjunctions like yours.

Your regex dictionary also contains full words. A Trie or, if memory is an issue, a DAWG would be perfect to match the full word portion of the regex dictionary.

The remaining problem is matching a large input to a large set of regexes, and a pre-filtering approach works very well in such cases: first check if there is a plausible match, and only if there is a plausible match then run the full regex match. i.e. in order for the 'an?' regex to match, the string has to have 'an' in it. Suffix trees are perfect for searching such substrings. You probably would want to add a string begin and string end marker if your regexes are always full matches. Of course you could also build a trie or dawg of the set of fixed strings and start a new search at every character. The same effect can be implemented more efficiently with a deterministic finite state automaton using a powerful automaton library.

Hopefully, this method will get rid of the need to run regex matching for a large portion of your input. Once you have to run the regular expression match, you can go with the python regex implementations in other answers.

Share:
12,242
patrick
Author by

patrick

Updated on June 04, 2022

Comments

  • patrick
    patrick almost 2 years

    I have a dictionary of word:count pairs in my Python script. From this dictionary, I would like to extract the entries matching any item in a list of strings.

    I have found a working solution using regular expressions (see below), but it takes forever (~ 10 hours for a run). I feel there must be a faster way of doings this - do you guys have any input / ideas on how to improve my code?

    import re
    
    dicti={'the':20, 'a':10, 'over':2}
    regex_list=['the', 'an?'] 
    
    extractddicti= {k:v for k,v in dicti.items() if any (re.match("^"+regex+"$",k) for regex in regex_list)} 
    

    In reality, the dictionary has about 60,000 entries, the regex_list is ~ 1,000. The items in the regex list are regex strings, i.e. contain special characters like ?, parentheses such as (a|b|c), etc. They might match several entries in the dictionary.

    Update / Edit

    (see the accepted answer for an even better implementation of the same idea)

    Following the advice of Keozon and others, I first compiled my regexes like so:

    regex_list=['the', 'an?']
    regex_list_compiled=[re.compile("^"+i+"$") for i in regex_list]
    

    and then slightly adapted my search function:

    extractddicti= {k:v for k,v in dicti.items() if any (re.match(regex,k) for regex in regex_list_compiled)} 
    

    The performance difference is quite breathtaking: A test run with a dictionary of 14800 items and a list of 1,100 regexes took 34 minutes without compilation, slightly less than one (!) minute with compilation. Did not expect it to be so dramatic. Thanks for all your help!

  • SethMMorton
    SethMMorton almost 8 years
    Also, for a bit of extra speed, you can do regex_match = regex.match and then call regex_match inside the comprehension. It's typically a micro-optimization to do this, but if you have a large loop it can help.
  • emmagordon
    emmagordon almost 8 years
    If you join up the keys then you'll get erroneous matches - e.g. if you join up the strings 'th' and 'e', then you'd get a match against 'the'. Could be an interesting line of thought though - perhaps if you joined up the keys with newlines or some other appropriate delimiter?
  • emmagordon
    emmagordon almost 8 years
    Thanks @SethMMorton - good point, I've updated my answer with your suggestion.
  • patrick
    patrick almost 8 years
    @caiohamamura thanks for your input; I went with E. Gordon's answer here as it was up earlier and easier to implement, but will try yours later if I need more speed. Thanks!