Speed up millions of regex replacements in Python 3

60,445

Solution 1

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric's answer is faster.

Solution 2

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP's, it's approximately 1000 times faster than the accepted answer.

If you don't care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn't do a very good job of optimizing the pattern.

It's possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren't really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

Regex union

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

Regex trie

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn't match), instead of trying the 5 words. It's a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here's a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here's a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...

It's really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here's a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:

Enter image description here

Solution 3

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you're done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, ...) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye's answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It's pretty hard to grasp the way the regex engine works, so let's write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn't


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.

Solution 4

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then "discarding" the result of this work before the next pass.

Solution 5

Well, here's a quick and easy solution, with test set.

Best strategy:

  • re.sub("\w+",repl,sentence) searches for words.
  • "repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
  • This is the simplest and fastest solution (see function replace4 in example code below).

Second best strategy:

  • The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
  • Implementation: (see function replace3 in example code below).

Timings for example functions:

replace1:     0.62 sentences/s
replace2:     7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240,000/s with PyPy)

...and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)
        
        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)
    
    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

EDIT: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)
Share:
60,445

Related videos on Youtube

pdanese
Author by

pdanese

\paragraph{blah blah blah blah blah}

Updated on December 24, 2020

Comments

  • pdanese
    pdanese over 3 years

    I have two lists:

    • a list of about 750K "sentences" (long strings)
    • a list of about 20K "words" that I would like to delete from my 750K sentences

    So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

    I am doing this by pre-compiling my words so that they are flanked by the \b word-boundary metacharacter:

    compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
    

    Then I loop through my "sentences":

    import re
    
    for sentence in sentences:
      for word in compiled_words:
        sentence = re.sub(word, "", sentence)
      # put sentence into a growing list
    

    This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

    • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

    • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.

    I'm using Python 3.5.2

    • pdanese
      pdanese about 7 years
      @MohammadAli No, I had not. I will look into it. If you have any suggested tutorial(s), please advise. Thank you
    • Mohammad Ali
      Mohammad Ali about 7 years
      The first answer here has some good sample code: stackoverflow.com/questions/2846653/… just divide up your sentence array by the number of CPU cores you have then run that many threads
    • pvg
      pvg about 7 years
      You can also try a non-regex implementation - traverse your input word by word and match every with a set. This is single pass and hash lookups are pretty quick.
    • pvg
      pvg about 7 years
      How long are these sentences, incidentally? 750k lines doesn't sound like a dataset that should be taking hours to process.
    • Antonín Lejsek
      Antonín Lejsek about 7 years
      You can even use regex to the non regex solution suggested by pvg :-) Just regex replace all "\S+" with function, that returns the same string when it is not in the set or empty string when it is in the set. My test (but not in python) proceses about 10M characters per second.
    • Kevin
      Kevin about 7 years
      @MohammadAli: Don't bother with that example for CPU-bound work. Python has a big lock that it takes when executing bytecode (the Global Interpreter Lock), so you can't benefit from threads for CPU work. You'd need to use multiprocessing (i.e. multiple Python processes).
    • Eric Duminil
      Eric Duminil about 7 years
      What do the sentences look like? Do they have punctuation? Is space the only delimiter between words? Replacing a word by an empty string will leave two spaces.
    • Eric Duminil
      Eric Duminil about 7 years
      @pvg: I've only read your comment after I posted an answer. Your suggestion is good, and the corresponding code appears to be much faster than the accepted solution.
    • pvg
      pvg about 7 years
      @EricDuminil ah, the match callback is a nice touch. It's definitely more explicit and clearer (to my eyes), as to faster, I didn't try it. What kind of time did you get with the concat'ed regexen in your example?
    • Eric Duminil
      Eric Duminil about 7 years
      @pvg: I didn't want to mention it explicitely, because I might have missed something. By using re.compile(r"\b(%s)\b" % '|'.join(words), re.IGNORECASE) with 100000 english words, the concated regexen took 160s for 10000 sentences. In comparison, my code took 7s for 100 times more sentences.
    • pvg
      pvg about 7 years
      @EricDuminil I think it's actually worth putting in the answer, along with the timings. The one thing I think is a little potentially confusing about it is the apparent suggestion the regex variant is a check against a list of words.
    • Eric Duminil
      Eric Duminil about 7 years
      @pvg If it turns out to be the correct timing, I sure would put it in my answer. I asked for confirmation from OP and Liteye, to be sure I don't disrespect a valid answer by saying it's very slow when it actually isn't. I'll update the answer, but "\b(word1|word2|word3)\b" is nothing more than a check against a list of words IMHO. In C, with a regexp, but the complexity still depends linearly on the number of words in the regex.
    • pvg
      pvg about 7 years
      @EricDuminil The compiled regex will short-circuit so it's not quite the same as linear list membership comparison. But that's a nit. As to the validity of the times, you don't have to claim your proposed solutions are faster for OP's specific dataset, just for the ones you tried. This question comes up a lot in various guises and could use more testable answers. The goal, after all, is 'usefulness to others' rather than simply 'usefulness to poster'.
    • Eric Duminil
      Eric Duminil about 7 years
      @pvg: For a word not included in (word1|word2|word3), regex would still need to check every word before saying there's no match, right?
    • smci
      smci about 7 years
      You're doing stopword removal aren't you? sklearn has vectorizers that do this with higher performance.
    • Amani Kilumanga
      Amani Kilumanga about 7 years
      This looks like it could be a good question for codereview.SE
    • Eric Duminil
      Eric Duminil about 7 years
      @user36476: Did you try my suggestion? I'd be happy to know how long it takes to process your data.
    • Admin
      Admin over 4 years
      You need an industrial strength tool to do this. A regex trie is generated from a ternary tree of a list of strings. There is never more than 5 steps to failure making this the fastest method to do this type of matching. Examples: 175,000 word dictionary or similar to your banned list just the 20,000 S-words
    • Admin
      Admin over 4 years
      Benchmark on just the S-words Completed iterations: 50 / 50 ( x 1 ) Matches found per iteration: 20050 Elapsed Time: 4.73 s, 4734.55 ms, 4734549 µs Matches per sec: 211,741 Also, see US City's Samples
    • Eric Duminil
      Eric Duminil about 4 years
      3 years later : thanks for your question. It allowed me to get 2 "great answer" badges :).
    • smci
      smci over 3 years
      This is a regex optimization question in disguise; your compiled_words is a horrifically inefficient list of 20K regexes (instead of one single regex). If words is known statically, you can optimize the regex even more.
  • Denziloe
    Denziloe about 7 years
    I think this is probably a more elegant implementation of my suggestion. I'm not 100% sure about the regex implementation but I think that will step through the string until it finds a boundary, and then check for each of the words at that point. So it only has to look for boundaries once. In addition I guess the C code might encode the words as a data structure with fast lookups such as a trie?
  • Liteye
    Liteye about 7 years
    I don't think it uses trie because there can be more complicated patterns. However this definitely uses loop in C instead of loop in Python, which is good for performance.
  • Denziloe
    Denziloe about 7 years
    Good point, thanks. @user36476 please let us know how it performs with this code, I am interested.
  • pvg
    pvg about 7 years
    It's not just the C impl (which makes a big difference) but you're also matching with a single pass. Variants of this question come up pretty often, it's a little odd there isn't (or maybe there is, hiding somewhere?) a canonical SO answer for it with this pretty sensible idea.
  • pdanese
    pdanese about 7 years
    @Liteye I will give this a shot. Thank you.
  • pdanese
    pdanese about 7 years
    @Liteye your suggestion turned a 4-hour job into a 4-minute job! I was able to join all 20,000+ regexes into a single gigantic regex and my laptop didn't bat an eye. Thanks again.
  • Bakuriu
    Bakuriu about 7 years
    @user36476 Regex are not just a shortcut for writing loops. They actually use some clever algorithms to search way faster than what you could possibly do by using an explicit loop with one pattern at a time.
  • user541686
    user541686 about 7 years
    @Bakuriu: s/They actually use/They actually could in theory sometimes use/. Do you have any reason to believe Python's implementation is doing anything other than a loop here?
  • Bakuriu
    Bakuriu about 7 years
    @Mehrdad Yes. Sure, it's still a backtracking engine so there are some cases where it takes a lot of time, but I'm pretty sure the simple cases like a disjunction of literal regexes should take O(n+m) time where n is the length of the text to be matched and m the length of the pattern. A plain naive loop is O(n·m). The regex the OP is uses shouldn't need any backtracking, hence I expect the regex to take linear time.
  • pdanese
    pdanese about 7 years
    You were correct. My indentation was wrong. I fixed it in the original question. As for the comment that 50 sentences / second is slow, all I can say is that I am providing a simplified example. The real data-set is more complicated than I am describing, but it didn't seem relevant. Also, concatenation of my "words" into a single regex massively improved the speed. Also, I am "squeezing" out double-spaces after the replacements.
  • Eric Duminil
    Eric Duminil about 7 years
    @user36476 Thanks for the feedback, I removed the corresponding part. Could you please try my suggestion? I dare say it's much faster than the accepted answer.
  • Eric Duminil
    Eric Duminil about 7 years
    @user36476: I've seen your comment update now, and changed the intro.
  • pvg
    pvg about 7 years
    Python 3 is a given in the question. I upvoted this but I think it might be worth sacrificing some of the detail and 'optimal' implementation in this code to make it less verbose.
  • Eric Duminil
    Eric Duminil about 7 years
    If I understand it correctly, it's basically the same principle as my answer, but more verbose? Splitting and joining on \W+ is basically like sub on \w+, right?
  • idmean
    idmean about 7 years
    Since you removed that misleading O(1) claim, your answer definitely deserves an up vote.
  • Eric Duminil
    Eric Duminil about 7 years
    @Bakuriu: I'd be really interested to know if that's the case, but I don't think the regex solution takes linear time. If it doesn't build a Trie out of the union, I don't see how it could happen.
  • Eric Duminil
    Eric Duminil about 7 years
    @Liteye: Thanks for the link to my answer, that's very fair of you!
  • Eric Duminil
    Eric Duminil about 7 years
    @idmean: True, that wasn't very clear. It was just referring to the lookup : "Is this word a banned word?".
  • bobflux
    bobflux about 7 years
    I wonder if my solution below (function replace4) is faster than pypy ;) I'd like to test on your files!
  • Eric Duminil
    Eric Duminil about 7 years
    Upvote for the tests. replace4 and my code have similar performances.
  • user541686
    user541686 about 7 years
    @Bakuriu: That's not a reason. I was asking if you have a reason to believe the implementation actually behaves that way, not whether you have a reason to believe it could behave that way. Personally I have yet to come across a single mainstream programming language's regex implementation that works in linear time the same way you'd expect a classical regex to, so if you know Python does this, you should show some evidence.
  • Matthieu M.
    Matthieu M. about 7 years
    @EricDuminil: Liteye's solution is only O(n*m) if the regular expression engine implements the look-up of the word as O(m). It hopefully optimizes it, though it may not end being as efficient as a look-up table.
  • Eric Duminil
    Eric Duminil about 7 years
    @MatthieuM. I agree on the "hopefully". I searched for a while, I couldn't find any indication that "word1|word2|word3" is optimized to "word[123]" for example.
  • Matthieu M.
    Matthieu M. about 7 years
    @EricDuminil: It's the kind of things that's really internal and hardly documented unfortunately, so short of reading the code or asking someone who did I am afraid we'll never know :(
  • Eric Duminil
    Eric Duminil about 7 years
    @MatthieuM. We can guess, though. I'll do some test with growing regexen, and I'll trie to build a regex trie out of the union to see if it speeds things up.
  • Eric Duminil
    Eric Duminil about 7 years
    @MatthieuM. See my update. Regexp union lookup does appear to be O(m)
  • Matthieu M.
    Matthieu M. about 7 years
    @EricDuminil: Great work! Wish I could upvote a second time.
  • Danny_ds
    Danny_ds about 7 years
    In case the sentences are (were) stored in a text file, they are already separated by a newline. So the whole file could be read in as one big string (or buffer), words removed, and then written back again (or this could be done in the file directly using memory mapping). Otoh, to remove a word, the remainder of the string has to be moved back to fill the gap, so that would be a problem with one very large string. An alternative would be to write the parts between the words back to another string or file (which would include the newlines) – or just move those parts in a mmapped file (1) ..
  • Danny_ds
    Danny_ds about 7 years
    .. That last approach (moving/writing the parts between the words) combined with Eric Duminil’s set lookup could be really fast, perhaps without even using regex at all. (2)
  • Danny_ds
    Danny_ds about 7 years
    .. Or maybe regex is already optimized to only move those parts when replacing multiple words, I don't know.
  • Eric Duminil
    Eric Duminil about 7 years
    Just for fun, I wrote an answer with a Trie Regexp. It's approximately 1000 times faster than a regexp union.
  • Xavier Combelle
    Xavier Combelle about 7 years
    Looks that for original purpose, there is no need for a non capturing group. At least the meaning of the non capturing group should be mentionned
  • Eric Duminil
    Eric Duminil about 7 years
    @XavierCombelle: You're right that I should mention the capturing group : the answer has been updated. I see it the other way around though : parens are needed for regex alternation with | but capturing groups aren't needed for our purpose at all. They'd just slow down the process and use more memory without benefit.
  • Eric Duminil
    Eric Duminil about 7 years
    @pvg: What do you mean with "single pass" exactly?
  • StatguyUser
    StatguyUser almost 7 years
    Not sure what def repl(m): is doing and how you are assigning m in the function replace4
  • StatguyUser
    StatguyUser almost 7 years
    Also i am getting error error: unbalanced parenthesis for the line patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]
  • Eric Duminil
    Eric Duminil over 6 years
    @CasimiretHippolyte: Merci bien, je suis assez fier de ces 2 solutions!
  • Mohamed AL ANI
    Mohamed AL ANI about 6 years
    @EricDuminil This post is perfect, thank you so much :)
  • Eric Duminil
    Eric Duminil about 6 years
    @MohamedALANI: My pleasure! It was a thrill to write the code and see it perform so well. Please take a look at my other answer, which is even faster.
  • Mohamed AL ANI
    Mohamed AL ANI about 6 years
    @EricDuminil Thanks what I'm actually trying to do is to find all words from a list (200,000 words and group of words, example : ["elephant", "black elephant", "tiger running in field"]) that occur in a text sample. I used your trie with a re.findall(trie.pattern(), text)but it doesn't seem to speed up my code :( Any idea ?
  • Eric Duminil
    Eric Duminil about 6 years
    @MohamedALANI: Compared to which solution?
  • Zoltan Fedor
    Zoltan Fedor over 5 years
    While replace3 and replace4 function addresses the original issue (to replace words), replace1 and replace2 are more general-purpose, as those work even if the needle is a phrase (a sequence of words) and not just a single word.
  • Hyrial
    Hyrial over 5 years
    How would one also add banned prefixes to the trie? For example, any word that starts with the prefix 'bull' would be matched by regex.
  • tharndt
    tharndt over 4 years
    Great job on the code snippets! I applied the regex union and the trie on processing ~50 million text segments with 70 banned words. The code with the regex union runs with ~7500 it/s; the trie outdoes it with ~12500 it/s
  • Eric Duminil
    Eric Duminil over 4 years
    @tharndt good to know, thanks. The relative difference would be even higher with more banned words.
  • Radio Controlled
    Radio Controlled over 4 years
    I guess it is building a precompiled Finite State Machine/Transducer in C. That should be fast to run. It only takes as many steps as the input string is long.
  • PV8
    PV8 over 4 years
    One question the unionwill only target excat matching of the orginal word list, so if I have ['Apple', 'Banana'] and I use a replace afterwards, it will only hit: apple, bananaand not app, apple, ... right?
  • Eric Duminil
    Eric Duminil over 4 years
    @PV8: It should only match complete words, yes, thanks to the \b (word boundary). If the list is ['apple', 'banana'], it will replace words which are exactly apple or banana, but not nana, bana or pineapple.
  • Admin
    Admin over 4 years
    Do you have a list of words I can pump into this tool to see how it matches up with what you get and to do a benchmark comparison using your output. If you could post a text of it somewhere I would appretiate it.
  • Eric Duminil
    Eric Duminil over 4 years
    @xt15: That tool looks awesome, thanks for the tip. The dictionary I used comes from Debian packages. Here's the list, which I just copied from /usr/share/dict/american-english
  • elexhobby
    elexhobby about 4 years
    Will this work if the words in banned_words themselves contain regular expressions?
  • Eric Duminil
    Eric Duminil about 4 years
    @elexhobby: I don't know, I'll try.
  • Eric Duminil
    Eric Duminil about 4 years
    @elexhobby: I just tried. It doesn't work yet, because re.escape is called on every input, so for example '.' is converted to '\.'. You can try to remove it but it might lead to weird behavior, especially with complex regexen. It might work with very basic examples, for example 'foo...' and '...bar'. A more complex logic would be required for anything else.
  • Wiktor Stribiżew
    Wiktor Stribiżew over 2 years
    This answer should not be accepted. Eric's regex trie answer must be accepted.