Mass string replace in python?

21,458

Solution 1

mydict = {"&y":"\033[0;30m",
          "&c":"\033[0;31m",
          "&b":"\033[0;32m",
          "&Y":"\033[0;33m",
          "&u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

for k, v in mydict.iteritems():
    mystr = mystr.replace(k, v)

print mystr
The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog

I took the liberty of comparing a few solutions:

mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print 'Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr))

# My solution
t = time()
for x in range(rep):
    for k, v in mydict.items():
        mystr.replace(k, v)
    #print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t

# Claudiu
def multiple_replace(dict, text): 
    # Create a regular expression  from the dictionary keys
    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

t = time()
for x in range(rep):
    multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t

# Andrew Y - fixed
def repl(s):
  return mydict["&"+s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t

Results in Python 2.6

Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict      1.04699993134
Peter fixed & variable dict    0.218999862671
Claudio variable dict          2.48400020599
Claudio fixed dict             0.0940001010895
Andrew Y variable dict         0.0309998989105
Andrew Y fixed dict            0.0310001373291

Both claudiu's and andrew's solutions kept going into 0, so I had to increase it to 10 000 runs.

I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn't wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions.

mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr)))

# Tor Valamo - too long
#t = time()
#for x in range(rep):
#    for k, v in mydict.items():
#        mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)

# Claudiu - too long
#def multiple_replace(dict, text): 
#    # Create a regular expression  from the dictionary keys
#    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
#    # For each match, look-up corresponding value in dictionary
#    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
#    multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)

# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

def mysubst2(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
    mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)

# Andrew Y - fixed
def repl(s):
  return mydict[s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)

# gnibbler
t = time()
for x in range(rep):
    myparts = mystr.split("&")
    myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
    "".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)

Results:

Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict      0.0 # disqualified 329 secs
Peter fixed & variable dict    2.07799983025
Peter fixed dict               1.53100013733 
Claudio variable dict          0.0 # disqualified, 37 secs
Claudio fixed dict             1.5
Andrew Y variable dict         0.578000068665
Andrew Y variable dict 2       0.56299996376
Andrew Y fixed dict            0.56200003624
gnibbler fixed & variable dict 0.530999898911

(** Note that gnibbler's code uses a different dict, where keys don't have the '&' included. Andrew's code also uses this alternate dict, but it didn't make much of a difference, maybe just 0.01x speedup.)

Solution 2

Try this, making use of regular expression substitution, and standard string formatting:

# using your stated values for str and dict:
>>> import re
>>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str)
>>> str % dict
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'

The re.sub() call replaces all sequences of ampersand followed by single letter with the pattern %(..)s containing the same pattern.

The % formatting takes advantage of a feature of string formatting that can take a dictionary to specify the substitution, rather than the more commonly occurring positional arguments.

An alternative can do this directly in the re.sub, using a callback:

>>> import re
>>> def dictsub(m):
>>>    return dict[m.group()]
>>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)

This time I'm using a closure to reference the dictionary from inside the callback function. This approach could give you a little more flexibility. For example, you could use something like dict.get(m.group(), '??') to avoid raising exceptions if you had strings with unrecognized code sequences.

(By the way, both "dict" and "str" are builtin functions, and you'll get into trouble if you use those names in your own code much. Just in case you didn't know that. They're fine for a question like this of course.)

Edit: I decided to check Tor's test code, and concluded that it's nowhere near representative, and in fact buggy. The string generated doesn't even have ampersands in it (!). The revised code below generates a representative dictionary and string, similar to the OP's example inputs.

I also wanted to verify that each algorithm's output was the same. Below is a revised test program, with only Tor's, mine, and Claudiu's code -- because the others were breaking on the sample input. (I think they're all brittle unless the dictionary maps basically all possible ampersand sequences, which Tor's test code was doing.) This one properly seeds the random number generator so each run is the same. Finally, I added a minor variation using a generator which avoids some function call overhead, for a minor performance improvement.

from time import time
import string
import random
import re

random.seed(1919096)  # ensure consistent runs

# build dictionary with 40 mappings, representative of original question
mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40))
# build simulated input, with mix of text, spaces, ampersands in reasonable proportions
letters = string.letters + ' ' * 12 + '&' * 6
mystr = ''.join(random.choice(letters) for i in range(1000))

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and %d ampersands'
    % (rep, len(mystr), mystr.count('&')))

# Tor Valamo
# fixed from Tor's test, so it actually builds up the final string properly
t = time()
for x in range(rep):
    output = mystr
    for k, v in mydict.items():
        output = output.replace(k, v)
print('%-30s' % 'Tor fixed & variable dict', time() - t)
# capture "known good" output as expected, to verify others
expected = output

# Peter Hansen

# build charset to use in regex for safe dict lookup
charset = ''.join(x[1] for x in mydict.keys())
# grab reference to method on regex, for speed
patsub = re.compile(r'(&[%s])' % charset).sub

t = time()
for x in range(rep):
    output = patsub(r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
assert output == expected

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time() - t)
assert output == expected

# Peter 3 - freaky generator version, to avoid function call overhead
def dictsub(d):
    m = yield None
    while 1:
        m = yield d[m.group()]

dictsub = dictsub(mydict).send
dictsub(None)   # "prime" it
t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter generator', time() - t)
assert output == expected

# Claudiu - Precompiled
regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub

t = time()
for x in range(rep):
    output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time() - t)
assert output == expected

I forgot to include benchmark results before:

    Running 10000 times with string length 1000 and 96 ampersands
    ('Tor fixed & variable dict     ', 2.9890000820159912)
    ('Peter fixed & variable dict   ', 2.6659998893737793)
    ('Peter fixed dict              ', 1.0920000076293945)
    ('Peter generator               ', 1.0460000038146973)
    ('Claudio fixed dict            ', 1.562000036239624)

Also, snippets of the inputs and correct output:

mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...}
output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'

Comparing with what I saw from Tor's test code output:

mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
output = # same as mystr since there were no ampersands inside

Solution 3

If you really want to dig into the topic take a look at this: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

The obvious solution by iterating over the dictionary and replacing each element in the string takes O(n*m) time, where n is the size of the dictionary, m is the length of the string.

Whereas the Aho-Corasick-Algorithm finds all entries of the dictionary in O(n+m+f) where f is the number of found elements.

Solution 4

If the number of keys in the list is large, and the number of the occurences in the string is low (and mostly zero), then you could iterate over the occurences of the ampersands in the string, and use the dictionary keyed by the first character of the substrings. I don't code often in python so the style might be a bit off, but here is my take at it:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

dict = {"&y":"\033[0;30m",
        "&c":"\033[0;31m",
        "&b":"\033[0;32m",
        "&Y":"\033[0;33m",
        "&u":"\033[0;34m"}

def rep(s):
  return dict["&"+s[0:1]] + s[1:]

subs = str.split("&")
res = subs[0] + "".join(map(rep, subs[1:]))

print res

Of course there is a question what happens when there is an ampersand that is coming from the string itself, you would need to escape it in some way before feeding through this process, and then unescape after this process.

Of course, as is pretty much usual with the performance issues, timing the various approaches on your typical (and also worst-case) dataset and comparing them is a good thing to do.

EDIT: place it into a separate function to work with arbitrary dictionary:

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

EDIT2: get rid of an unneeded concatenation, seems to still be a bit faster than the previous on many iterations.

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

Solution 5

Here is the C Extensions Approach for python

const char *dvals[]={
    //"0-64
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","",
    //A-Z
    "","","","","",
    "","","","","",
    "","","","","",
    "","","","","",
    "","","","","33",
    "",
    //
    "","","","","","",
    //a-z
    "","32","31","","",
    "","","","","",
    "","","","","",
    "","","","","",
    "34","","","","30",
    ""
};

int dsub(char*d,char*s){
    char *ofs=d;
    do{
        if(*s=='&' && s[1]<='z' && *dvals[s[1]]){

            //\033[0;
            *d++='\\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';';

            //consider as fixed 2 digits
            *d++=dvals[s[1]][0];
            *d++=dvals[s[1]][1];

            *d++='m';

            s++; //skip

        //non &,invalid, unused (&) ampersand sequences will go here.
        }else *d++=*s;

    }while(*s++);

    return d-ofs-1;
}

Python codes I have tested

from mylib import *
import time

start=time.time()

instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.\n"*100000
x=dsub(instr)

end=time.time()

print "time taken",end-start,",input str length",len(x)
print "first few lines"
print x[:1100]

Results

time taken 0.140000104904 ,input str length 11000000
first few lines
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.

Its suppose to able to run at O(n), and Only took 160 ms (avg) for 11 MB string in My Mobile Celeron 1.6 GHz PC

It will also skip unknown characters as is, for example &Unknown will return as is

Let me know If you have any problem with compiling, bugs, etc...

Share:
21,458

Related videos on Youtube

Mike Trpcic
Author by

Mike Trpcic

I am a software developer located in San Francisco specializing in Web Applications and Business Solutions. I am currently focused on using Ruby on Rails to build dynamic web applications, but have experience with a multitude of other technologies. Feel free to browse my website for more information.

Updated on April 01, 2020

Comments

  • Mike Trpcic
    Mike Trpcic about 4 years

    Say I have a string that looks like this:

    str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
    

    You'll notice a lot of locations in the string where there is an ampersand, followed by a character (such as "&y" and "&c"). I need to replace these characters with an appropriate value that I have in a dictionary, like so:

    dict = {"&y":"\033[0;30m",
            "&c":"\033[0;31m",
            "&b":"\033[0;32m",
            "&Y":"\033[0;33m",
            "&u":"\033[0;34m"}
    

    What is the fastest way to do this? I could manually find all the ampersands, then loop through the dictionary to change them, but that seems slow. Doing a bunch of regex replaces seems slow as well (I will have a dictionary of about 30-40 pairs in my actual code).

    Any suggestions are appreciated, thanks.

    Edit:

    As has been pointed out in comments throught this question, my dictionary is defined before runtime, and will never change during the course of the applications life cycle. It is a list of ANSI escape sequences, and will have about 40 items in it. My average string length to compare against will be about 500 characters, but there will be ones that are up to 5000 characters (although, these will be rare). I am also using Python 2.6 currently.

    Edit #2 I accepted Tor Valamos answer as the correct one, as it not only gave a valid solution (although it wasn't the best solution), but took all others into account and did a tremendous amount of work to compare all of them. That answer is one of the best, most helpful answers I have ever come across on StackOverflow. Kudos to you.

    • Peter Hansen
      Peter Hansen over 14 years
      As Tor Valamo points out, you might also want to consider error conditions -- such as if you have ampersand sequences that aren't in your dictionary -- and the case where you have an ampersand in the string that should be left alone as it's part of the textual content.
    • Peter Hansen
      Peter Hansen over 14 years
      Mike, in addition to knowing the overall string length, it would be important to know the density of escape sequences, or total per string, or something, for full benchmarking.
    • Mike Trpcic
      Mike Trpcic over 14 years
      Peter: That's not predictable, as some strings will have 15 characters with 15 escape sequences, and some will have 500 characters with 1 escape sequence. The strings are coming from the user, and as such, can be anything they want. For benchmarking, I would assume one escape sequence per 25 regular characters.
    • Tor Valamo
      Tor Valamo over 14 years
      If the strings are coming from the user I'd say error handling is kinda nice, eh Peter? :P
    • Peter Hansen
      Peter Hansen over 14 years
      @Tor, of course if error handling is now a requirement, then one provides it. It hasn't been defined what you'd want to do in the case of input text containing, for example "A&W root beer", if "&W" was also an escape code.
    • Peter Hansen
      Peter Hansen over 14 years
      By the way, Tor's "error handling" (not barfing on codes missing from dictionary) is trivially handled with the regex solutions simply by ensuring the regex specifies only valid codes instead of [a-zA-Z]. If "Q" is not a valid code in the dict, just leave it out of the character set: [a-zA-PR-Z]. For that matter, in a real solution you'd just build the regex dynamically based on the dictionary.
    • Peter Hansen
      Peter Hansen over 14 years
      @Mike, please see my revised answer. Possibly the "faster" alternatives will work if you can tweak them, but with a test program that actually matches your input samples (which I don't believe Tor's does), those others didn't seem to work. My solution actually appears fastest after all, at least at the moment. Solutions to those others are welcome, or someone could point out why they fail in my test.
  • Tor Valamo
    Tor Valamo over 14 years
    actually it doesn't suffer from the string length it suffers from the dict length.
  • Claudiu
    Claudiu over 14 years
    already modified. not sure if this code is faster than doing the loop itself; it does have an extra function call for each replacement. will have to time it for that.
  • Tor Valamo
    Tor Valamo over 14 years
    This has a problem though... if the string contains a match that isn't in the dictionary...
  • charstar
    charstar over 14 years
    Yikes, this would become terribly expensive for any large dictionary and large text.
  • Peter Hansen
    Peter Hansen over 14 years
    The OP didn't specify that he required bullet-proofing. He might say "it's guaranteed that all sequences found in the string are in the dictionary". If every answer without perfect error handling were deleted from StackOverflow, there'd be only a handful left...
  • Tor Valamo
    Tor Valamo over 14 years
    It's not just about error handling, but the fact that in this will fail completely on the smallest error. I see your second alternative handles this though, with the default return value.
  • Mike Trpcic
    Mike Trpcic over 14 years
    My dictionary will have about 40 entries in it, and most of my strings will be under 500 characters. How expensive would this be compared to a looping str.replace() or the suggestion by Peter Hanson?
  • Peter Hansen
    Peter Hansen over 14 years
    Sometimes I very much want code that "fails completely on the smallest error". If it did not, I wouldn't find the problem in the other part of my program which was inserting the ampserand escape sequences in the first place. Of course, my unit tests for that other piece tell me it generates only those patterns covered by the dictionary shown, so I know I don't need the redundant error handling that you're talking about adding to my nice clean program, burdening me with extra maintenance overhead. (As you can see, there are times when some people would consider error handling unnecessary.)
  • John La Rooy
    John La Rooy over 14 years
    This is straightforward and probably faster than regex unless the real replacement dict is much larger than 5 elements
  • Mike Trpcic
    Mike Trpcic over 14 years
    gnibbler: My actual dict is going to be about 40 elements.
  • John La Rooy
    John La Rooy over 14 years
    @Mike, you would have to test to be sure, but regex really are a lot slower than simple replace. I've posted an answer that uses split/join will be interesting to see which approach is better under various conditions
  • John La Rooy
    John La Rooy over 14 years
    This will only work if there is always a space before the ampersand
  • Bon Pham
    Bon Pham over 14 years
    Interesting... the reason I thought the string length would be more important was because it only loops through the dictionary once but it searches through the string repeatedly. I understand that both will impact the speed, but why does it suffer more from the dict length? (not questioning that you're right, just wondering why?)
  • Peter Hansen
    Peter Hansen over 14 years
    You're not being very fair to Claudiu. First, you're calling his as a function, and function call overhead isn't ignorable in Python. Second, his compilation step would not be done each time, but once at program startup.
  • Peter Hansen
    Peter Hansen over 14 years
    @Tor, you're having enough fun with this: make sure you upvote Mike's question! :-)
  • Tor Valamo
    Tor Valamo over 14 years
    Because you call the replace once per dict item, so the more dict items the more calls. If the string is longer it wouldn't affect it as much. But it doesn't matter too much anyways, if you see my answer with the benchmarks. :P
  • Tor Valamo
    Tor Valamo over 14 years
    @Peter - You're assuming the dictionary is the same for each call, which it may not be. Therefore I'm compiling it for each run. But yeah I'll add a notice about that being the cause of it.
  • Peter Hansen
    Peter Hansen over 14 years
    @Tor, if you look at what the code is doing, it's inserting ANSI escape sequences for printing things like bold, underline, etc. Those codes are almost certainly pre-defined and unchanging for the OP, so I think it's a good bet you would have a static dict and compile at startup. I could be wrong.
  • Claudiu
    Claudiu over 14 years
    If you run my answer correctly: regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print(time()-t) Then, for 10000 runs, the numbers I get are: 0.375 1.28100013733 0.593999862671 So yours is still faster, but mine is much closer.
  • Tor Valamo
    Tor Valamo over 14 years
    @Andrew - I timed yours to half the speed of mine, but yours is assuming the the dictionary is consistent, which may not always be the case. But if it is, you're winning.
  • Mike Trpcic
    Mike Trpcic over 14 years
    Peter is correct. I will have a predfined dictionary of ANSI escape sequences (for foreground, background, underlining, etc). This list will never change during runtime, so one compile is perfectly sufficient.
  • Tor Valamo
    Tor Valamo over 14 years
    @Claudiu - Then your also assuming that the dict is constant, which is ok, but I'll have to differentiate between that.
  • Mike Trpcic
    Mike Trpcic over 14 years
    I would love to see an updated benchmark with these new revelations taken into account, as well as some more of the alternatives.
  • Andrew Y
    Andrew Y over 14 years
    @Tor: actuallymy initial speedup of 40x was due to a stupid typo, with revised test my runtime is about 2/3 of yours - though I did uncomment the print(mystr) statement in your example - and it prints the original string ? I do not code python much, so, not sure if it's me doing something wrong or indeed there is a bug somewhere
  • Tor Valamo
    Tor Valamo over 14 years
    I updated the post with new benchmarks including Andrew's, and differing between fixed and variable dicts.
  • Mike Trpcic
    Mike Trpcic over 14 years
    Tor you are truly a master of your craft. I applaud your dedication to helping me with this. Thank you.
  • Peter Hansen
    Peter Hansen over 14 years
    @Tor, Mike has said in another comment that his strings may be quite a bit larger ("most... under 500"). Could you try a larger string too, maybe like what you have but with a dozen or so non-code characters between each pair of codes? This may differentiate the solutions more (yours probably getting the advantage).
  • Mike Trpcic
    Mike Trpcic over 14 years
    If you read the edits to my post, I've mentioned that most strings will be under 500 characters, but there will be some that are as large as 5000.
  • Andrew Y
    Andrew Y over 14 years
    @Tor: I've added the variant which wraps the same logic into the function that takes the string to be processed and dictionary as a parameter. Somewhere in the middle between my previous result and yours, it seems.
  • Tor Valamo
    Tor Valamo over 14 years
    I updated with new benchmarks of strings with randoms inserts. Dramatic change in results!
  • Tor Valamo
    Tor Valamo over 14 years
    Also updated with Andrew's latest, who now beats everybody both on variable and fixed dicts.
  • Mike Trpcic
    Mike Trpcic over 14 years
    It seems that Andrew Y's solution is consistently at the top of the list.
  • Andrew Y
    Andrew Y over 14 years
    Your code barfs unless I replace mydict[x[0]] with mydict["&" + x[0]] - when I do, it is a tiny bit faster than my first approach.
  • visual_learner
    visual_learner over 14 years
    I would be using lambda m: dict[m.group()] for this (but I would also abstract this functionality into it's own function).
  • Andrew Y
    Andrew Y over 14 years
    @Mike: we should test mine on long strings with a lot of characters to be replaced. I wonder if it will become more problematic then.
  • Bon Pham
    Bon Pham over 14 years
    Right, I was saying compared to other methods it would suffer due to the string length, because every method would have to loop through the entire dictionary, but not every method would have to search through the string repeatedly. However, you're right that it doesn't really matter, just curious. :-p
  • Tor Valamo
    Tor Valamo over 14 years
    @Andrew, @Peter, @Claudiu - Now tested with A LOT of inserts on LONG strings of unicode in python 3.
  • Mike Trpcic
    Mike Trpcic over 14 years
    Andrew Y is still kickin' it on top.
  • Andrew Y
    Andrew Y over 14 years
    @Tor: I removed the extra concatenation in the "variable" example - see my edit2. This should shave off one more tiny bit, according to my tests :-)
  • Tor Valamo
    Tor Valamo over 14 years
    Peter's alternate solution was tested (fixed dict) and slightly slower (0.01 times) than Claudiu's fixed dict solution.
  • Tor Valamo
    Tor Valamo over 14 years
    @Andrew, yes, slightly faster, however you're so far on top I'm done editing my post for today :P
  • Andrew Y
    Andrew Y over 14 years
    @Tor: indeed, time to sleep for me as well. Thanks for oding this, it was fun! :)
  • Tor Valamo
    Tor Valamo over 14 years
    I added it just for completeness, but just the results, not the code. My own solution in this test was so slow that I'm still laughing.
  • ghostdog74
    ghostdog74 over 14 years
    i don't want to assume too much. since OP provided samples, i will work with that sample.
  • John La Rooy
    John La Rooy over 14 years
    @Andrew, I suspect you are using a version of mydict with '&' in front of the keys. Mine doesn't have those
  • John La Rooy
    John La Rooy over 14 years
    @Tor, can you add my answer to your benchmark? stackoverflow.com/questions/1919096/…. Note that my mydict doesn't have the & in front of the keys
  • Tor Valamo
    Tor Valamo over 14 years
    Yours assumes that every & is followed by a replacement, which will quickly crash once a character that isn't in the dict suddenly appears.
  • John La Rooy
    John La Rooy over 14 years
    @Andrew, you can use single letters for the keys as I did in my answer as the & is implied by the split. That saves doing the "&"+.. for each element
  • Tor Valamo
    Tor Valamo over 14 years
    @gnibbler, yours is now the fastest.
  • Tor Valamo
    Tor Valamo over 14 years
    I altered this code to run with non-& dict, and it didn't make much difference. gnibbler's is still faster.
  • Tor Valamo
    Tor Valamo over 14 years
    Can you benchmark it using my test? It would seem that if you wanted to change the dictionary you'd have to do a lot of work...
  • Andrew Y
    Andrew Y over 14 years
    @gnibbler: ah yes. I used the original data. Sorry, indeed.
  • Tor Valamo
    Tor Valamo over 14 years
    I see a bug, it's not replacing the character, only the ampersand.
  • YOU
    YOU over 14 years
    Could you tell me which part of the code? *d++=dvals[s[1]][0];*d++=dvals[s[1]][1]; supposed to do that replacing actually.
  • Tor Valamo
    Tor Valamo over 14 years
    +1. For this current problem it seems a bit like overkill for an occasional string replacement though. :P
  • Tor Valamo
    Tor Valamo over 14 years
    The &yquick -> The \033[0;30myquick. That y shouldn't be there.
  • YOU
    YOU over 14 years
    for dictionary changes, only need to update dvals, and recompile it, only the compiling will be extra step.
  • Claudiu
    Claudiu over 14 years
    what's the disqualified stuff for?
  • Andrew Y
    Andrew Y over 14 years
    @Tor: I doublechecked - and am I right that in the latest test code there are no ampersands at all ? then gnibbler and my code would win, indeed. But we should debug the test suite a bit better tomorrow, imho.
  • Tor Valamo
    Tor Valamo over 14 years
    Took too long to run. See the seconds behind. So I only ran it once and never again.
  • Tor Valamo
    Tor Valamo over 14 years
    I'll post my python 3 test code which uses unicode chars and a HUGE dictionary. It puts the solutions under extreme workloads (at least on 10.000 runs :P). But you could come up with better dictionaries too, such as variable length, though that would void most of the solutions, except a few.
  • Andrew Y
    Andrew Y over 14 years
    @Tor: looking forward :) @gnibbler: seems like getting rid of the concat does not make much difference between our scenarios, which is interesting. I think the difference between yours and mine is because of the map/lambda overhead in mine ? (else they are equivalent, it seems).
  • Peter Hansen
    Peter Hansen over 14 years
    @Chris, I consider lambda mostly a way of obfuscating code by encouraging one to pack more of it onto one line, where a nice separate named function is more readable. To each his own. I'm sure we'd all package this whole thing as its own function in real code though.
  • Peter Hansen
    Peter Hansen over 14 years
    I believe this fails in the case of double ampsersands.
  • Peter Hansen
    Peter Hansen over 14 years
    @Andrew Y, only now did I noticed your comment about "I doublechecked". I believe you're right, Tor's test code was not using representative inputs. My updated answer includes better inputs, but your code fails on it, probably because it doesn't handle ampsersand sequences that are not in the dictionary.
  • Peter Hansen
    Peter Hansen over 14 years
    Paul, at least it works on real input (verified using the test code in my updated answer), where some others do not. Unfortunately it is very slow compared to the others: it takes 282 times longer than the re.sub solution.
  • Peter Hansen
    Peter Hansen over 14 years
    @Michael, the actual reason it isn't as scalable is merely because the string replacement is a loop in pure C, while the dictionary loop is a loop in Python. Where performance matters, in Python, you generally don't want to do lots of Python operations inside Python loops.
  • Mike Trpcic
    Mike Trpcic over 14 years
    I've been trying to get your solution working (using the callback that calls dict.get(m.group(), '??'), but I keep receiving the following error: "exceptions.TypeError: descriptor 'get' requires a 'dict' object but received a 'str'". Any suggestions as to why? Platform is Python 2.6.
  • Peter Hansen
    Peter Hansen over 14 years
    I'm on Python 2.6 as well so it will work. Sounds like you're using the variable name "dict" instead of "mydict" or "d" or whatever. In your case, it's the dictionary class from the builtins, not a name you've assigned something to. If you make sure not to shadow builtin names (remove any use of "dict", "str" etc as variable names) your problems should vanish. Otherwise please post code and we can fix it.
  • Peter Hansen
    Peter Hansen over 14 years
    Also, you should probably use the technique in my updated answer labelled "Peter 2" but including the earlier part where I build "charset" dynamically from the dictionary. That way you can avoid the get() call and the '??' argument entirely, making it faster and simpler yet still robust.
  • Peter Hansen
    Peter Hansen over 14 years
    @Mike, the specific error would come because you're mixing two naming conventions. Your original code used "dict" for the variable name, so that's what my code sample used. However the "dict" in that "dict.get" must be whatever the dictionary variable is named. You've got it named something else now (mydict?) so the code isn't working at the moment. In the updated test code, the only use of "dict" is the builtin name, so try using the code from there and make sure your original variables are not "str" and "dict" any more.
  • John La Rooy
    John La Rooy over 14 years
    @Peter, Thanks, I added a fix for double &
  • Peter Hansen
    Peter Hansen over 14 years
    @gnibbler, with an adjustment to my test code to pre-build your version of the dictionary (no "&" in the keys) this comes out just barely slower than Claudiu's solution, roughly twice as fast as Tor's.
  • Andrew Y
    Andrew Y over 14 years
    @Peter: yes, it throws the exception for the sequences that are not in the dictionary. I wrote in the beginning that the standalone ampersands would need to be escaped before running.