Text processing - Python vs Perl performance

python regex performance perl text-processing

42,608

Solution 1

This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.

One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')

And then in your loop:

mprev = exists_re.search(currline)

and

mcurr = location_re.search(currline)

That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.

Solution 2

Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.

What do you get by replacing

^(.*?) INFO.*Such a record already exists

with

^((?:(?! INFO).)*?) INFO.*Such a record already

^(?>(.*?) INFO).*Such a record already exists

Solution 3

Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:

fn = fileinput.filename()

Move this line above the for loop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.

Solution 4

In general, all artificial benchmarks are evil. However, everything else being equal (algorithmic approach), you can make improvements on a relative basis. However, it should be noted that I don't use Perl, so I can't argue in its favor. That being said, with Python you can try using Pyrex or Cython to improve performance. Or, if you are adventurous, you can try converting the Python code into C++ via ShedSkin (which works for most of the core language, and some - but not all, of the core modules).

Nevertheless, you can follow some of the tips posted here:

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Solution 5

I expect Perl be faster. Just being curious, can you try the following?

#!/usr/bin/python

import re
import glob
import sys
import os

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for mask in sys.argv[1:]:
    for fname in glob.glob(mask):
        if os.path.isfile(fname):
            f = open(fname)
            for line in f:
                mex = exists_re.search(line)
                if mex:
                    xlogtime = mex.group(1)

                mloc = location_re.search(line)
                if mloc:
                    print fname, xlogtime, mloc.group(1)
            f.close()

Update as reaction to "it is too complex".

Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...

while (<>) {
    ...
}

... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_files():
    '''The generator loops through the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                yield fname


exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname in input_files():
    with open(fname) as f:        # Now the f.close() is done automatically
        for line in f:
            mex = exists_re.search(line)
            if mex:
                xlogtime = mex.group(1)

            mloc = location_re.search(line)
            if mloc:
                print fname, xlogtime, mloc.group(1)

Here the def input_files() could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...} easily, even though not the same way syntactically:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_lines():
    '''The generator loops through the lines of the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                with open(fname) as f: # now the f.close() is done automatically
                    for line in f:
                        yield fname, line

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname, line in input_lines():
    mex = exists_re.search(line)
    if mex:
        xlogtime = mex.group(1)

    mloc = location_re.search(line)
    if mloc:
        print fname, xlogtime, mloc.group(1)

Then the last for may look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancements are more difficult in Perl.

Anyway, it will not make the Python program faster. Perl will be faster again here. Perl is a file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

View more solutions

42,608

ihightower

Updated on July 05, 2022

Comments

ihightower almost 2 years
Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the log repeated 5 times).

Python Code (code modified to use compiled re and using re.I)
```
#!/usr/bin/python

import re
import fileinput

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for line in fileinput.input():
    fn = fileinput.filename()
    currline = line.rstrip()

    mprev = exists_re.search(currline)

    if(mprev):
        xlogtime = mprev.group(1)

    mcurr = location_re.search(currline)

    if(mcurr):
        print fn, xlogtime, mcurr.group(1)
```
Perl Code
```
#!/usr/bin/perl

while (<>) {
    chomp;

    if (m/^(.*?) INFO.*Such a record already exists/i) {
        $xlogtime = $1;
    }

    if (m/^AwbLocation (.*?) insert into/i) {
        print "$ARGV $xlogtime $1\n";
    }
}
```
And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.
```
User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log

real    0m8.185s
user    0m8.018s
sys     0m0.092s

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log

real    0m1.481s
user    0m1.294s
sys     0m0.124s
```
Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.

(UPDATE) but, after the compiled re version of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.

Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.

By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/ match and print stuff.

Please do not suggest to use C, C++, Assembly, other flavours of Python, etc.

I am looking for a solution using Standard Python with its built-in modules compared against Standard Perl (not even using the modules). Boy, I wish to use Python for all my tasks due to its readability, but to give up speed, I don't think so.

So, please suggest how can the code be improved to have comparable results with Perl.

UPDATE: 2012-10-18

As other users suggested, Perl has its place and Python has its.

So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.

Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.

So, Perl rocks for text processing and regex.

Python has its place to rock in other places.

Update 2013-05-29: An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.
- ikegami over 11 years
  
  Are the patterns only compiled once in Python (as they are in Perl)?
- ikegami over 11 years
  
  Are the two programs equivalent? I don't see anything like /i in the Python version.
- nneonneo over 11 years
  
  They're not totally equivalent ((?i) or re.I should be added for Python), but very close.
- ikegami over 11 years
  
  I wonder if the difference is in the time spent backtracking in lines that don't match.
- ihightower over 11 years
  
  i have edited the code to compile the re and use re.I. Benchmarking again and have updated the results in my question.
- pepr over 11 years
  
  It would be also good to know the versions of Perl and of Python (the x for 2.x). The line.rstrip() is not neccessary.
- pepr over 11 years
  
  @ihightower: What are the exact arguments passed to the script? Are they really *log* *log* *log* *log* *log*? If yes, are you sure that Perl does not extract only unique filenames? (Thus processing actually less files...)
- ikegami over 11 years
  
  @pepr, Perl will process all files.
- ikegami over 11 years
  
  You can probably speed both versions up a tiny bit by using /s.
- Schwern over 11 years
  
  I'd run the Python code through a profiler to discover where its spending its time. You might also try using PCRE (Perl Compatible Regular Expressions) rather than the Python built in regexes (here's another implementation) and see if that does better.
- pepr over 11 years
  
  "Closed as too localized" seems too funny and subjective to me.
- Leon Timmermans over 11 years
  
  I've seen benchmarsk before that sugggest that Perl's regexp implementation is just that much faster than Pythons. Otherwise they should be of comparable speed.
- Dane White over 10 years
  
  Repeated function lookup in Python can take a surprising amount of time during long loops. So you should use exists_re = re.compile(...).search, and then call exists_re(currline) in your loop (and something similar for location_re). You should also move fn = fileinput.filename() outside the loop that does your line iterations, although to do that, you'd probably want to stop using fileinput. Since both of your regex match the line start, you might also try switching to re.match instead of re.search.
- nawfal almost 10 years
  
  In case somebody wants to see, some results here
- PYPL about 9 years
  
  you should check it once more since python has now updated to 2.8.9 from 2.4.4
- coder.in.me almost 9 years
  
  I read that re2 implemented by Google is better. I tried it but no improvement: re (4.5 sec), re2 (4.5 sec), perl (0.8 sec)
- TheAmigo over 8 years
  
  You could probably speed up both versions by using elsif... Unless both of those regexes can match the same line.
- Davide Brunato about 8 years
  
  In Python, when you perform a pattern matching with the start of the string (r"^...") don't use pattern.search() method, use pattern.match method instead, that is a bit faster.
nneonneo over 11 years

re caches recently-used regexes, so this is probably not a huge issue.
Admin over 11 years

@nneonneo I've heard that numerous times and I've seen the lines in the re source code which do the caching. But somehow I've never seen a benchmark that puts the two in the same order of magnitude, but several benchmarks (including a quick and dirty one I did a second ago) which put the pre-compiling option at several times faster.
nneonneo over 11 years

Interesting. Well, it's definitely good practice to precompile regexes, but I didn't really pay attention to the performance gap. Care to share the numbers?
ihightower over 11 years

i am neither an expert perl or python programmer. I use perl and python in such a way from what I read from an ordinary beginner to intermediate level book. If I care to have the real performance, certainly I will use your suggestions and even use assembly (if i ever learn it). Using what is readily available with in perl or python and its modules should be the only suggestion I expect to improve the code for performance. I don't expect to use some other magic buzzwords and spend the time to learn the rest. Please suggest pure solution that exists with in the nromal python installation.
ihightower over 11 years

i understand all artificial benchmarks could be evil. But, the text processing is a simple one and this is what I do normally day in day out. So, if python cannot improve the speed at using some basic syntax with in the original python installation... (just as i do with perl)... I will have to resort to perl for my text processing tasks.. and to process the 100s or 100000s of files that I have to process... and one will have to admit that python is slow for simple text processing as given in my code. But, boy do i wish to use python for its clean syntax, but with lag of speed.. don't think so.
pepr over 11 years

Regular expresions in Python are supplied via the module. Regular expressions in Perl has the built-in syntax and can be compiled as inlines (no function-call overhead cost). Text processing need not to be that simple. Anyway, use better tool for each task. My personal experience is that a bit more complex Perl programs are much more difficult to read and maintain in future.
pepr over 11 years

+1 for the good eye, but... Well, but the filename changes. It is not a loop invariant. Anyway, it may be faster not to use the fileinput module and add another, outer loop through the filenames. Then the filename would be the invariant.
dan1111 over 11 years

-1. What is "evil" about this? It is a simple exercise that illustrates a significant performance difference between the two langauges. How exactly are you supposed to compare the performance of two tools if not with a test like this? Write your entire program in both languages so that it is not "artificial"? Sure, there are pitfalls to benchmarking, but you have generalized that to a very dumb rule.
dan1111 over 11 years

An interesting point, but this has to be miniscule compared to the processing time of two regexes.
Craig Ringer over 11 years

@ihightower Please post your attempted edit as a new answer instead.
ihightower over 11 years

@pepr i have posted my results as separate answer. now the code runs in 6.1 secs (2 sec improvement from earlier) compared to perl's 1.8 secs. pls read my answer for more info.
pepr over 11 years

@ihightower: Using the with construct it would be one line shorter. It is true that the nested for looks terrible. However, they say what exactly is done: 1) get the command-line arguments, 2) expand each argument as a glob mask, 3) if it is a file name, open it and process its lines.
ihightower over 10 years

As Text Processing is sooo universal, then why Python won't just make a builtin-in Standard Module that is so generic that it can be applied to almost all cases.. it can then improve its performance for normal users like the vast majority of people... for e.g. import TextTool or something, then have some standard stuff that will improve the performance of the Text Processing.