Why Python is so slow for a simple for loop?

python performance jit

66,239

Solution 1

Because you mention scientific code, have a look at numpy. What you're doing has probably already been done (or rather, it uses LAPACK for things like SVD). When you hear about python being used for scientific code, people probably aren't referring to using it in the way you do in your example.

As a quick example:

(If you're using python3, your example would use float division. My example assumes you're using python2.x, and therefore integer division. If not, specify i = np.arange(9999, dtype=np.float), etc)

import numpy as np
i = np.arange(9999)
j = np.arange(1, 9999)
print np.divide.outer(i,j).sum()

To give some idea of timing... (I'll use floating point division here, instead of integer division as in your example):

import numpy as np

def f1(num):
    total = 0.0
    for i in range(num): 
        for j in range(1, num):
            total += (float(i) / j)
    return total

def f2(num):
    i = np.arange(num, dtype=np.float)
    j = np.arange(1, num, dtype=np.float)
    return np.divide.outer(i, j).sum()

def f3(num):
    """Less memory-hungry (and faster) version of f2."""
    total = 0.0
    j = np.arange(1, num, dtype=np.float)
    for i in xrange(num):
        total += (i / j).sum()
    return total

If we compare timings:

In [30]: %timeit f1(9999)
1 loops, best of 3: 27.2 s per loop

In [31]: %timeit f2(9999)
1 loops, best of 3: 1.46 s per loop

In [32]: %timeit f3(9999)
1 loops, best of 3: 915 ms per loop

Solution 2

I think NumPy can be faster than CPython for loops (I didn't test in PyPy).

I want to start from Joe Kington's code because this answer used NumPy.

%timeit f3(9999)
704 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

by myself:

def f4(num):
    x=np.ones(num-1)
    y=np.arange(1,num)
    return np.sum(np.true_divide(x,y))*np.sum(y)

155 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, High School Mathematics can simplify the problem to computer.

Problem= (1+2+...+(num-1)) * (1/1+1/2+...+1/(num-1))
1+2+...+(num-1)=np.sum(np.arange(1,num))=num*(num-1)/2
1/1+1/2+...+1/(num-1)=np.true_divide (1,y)=np.reciprocal(y.astype(np.float64))

Therefore,

def f5(num):
    return np.sum(np.reciprocal(np.arange(1, num).astype(np.float64))) * num*(num-1)/2
%timeit f5(9999)
106 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, University Mathematics can simplify the problem to computer more.

1/1+1/2+...+1/(num-1)=np.log(num-1)+1/(2*num-2)+np.euler_gamma
(n>2)

np.euler_gamma: Euler-Mascheroni constant (0.57721566...)

Because of inaccuracy of Euler-Mascheroni constant in NumPy, You lose accuracy like 489223499.9991845 -> 489223500.0408554. If You can ignore 0.0000000085% inaccuracy, You can save more time.

def f6(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
%timeit f6(9999)
4.82 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Benefit of NumPy becomes larger with larger input.

%timeit f3(99999)
56.7 s ± 590 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f5(99999)
534 µs ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f5(99999999)
1.42 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.498947911958**416**e+16
%timeit f6(99999999)
4.88 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.498947911958**506**e+16
%timeit f6(9999999999999999999)
17.9 µs ± 921 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In special case, You can use numba (Unfortunately not always).

from numba import jit
@jit
def f7(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
# same code with f6(num)

%timeit f6(999999999999999)
5.63 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
f7(123) # compile f7(num)
%timeit f7(999999999999999)
331 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f7(9999)
286 ns ± 3.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, I recommend to use NumPy, mathematics and numba together.

Solution 3

Python for loops are statically typed and interpreted. Not compiled. Java is faster because it has extra JIT acceleration features that Python does not have.

http://en.wikipedia.org/wiki/Just-in-time_compilation

To illustrate just how massive a difference Java JIT makes, look at this python program that takes about 5 minutes:

if __name__ =='__main__':
    total = 0.0
    i=1
    while i<=9999:
        j=1
        while j<=9999:
            total=1
            j+=1
        i+=1
    print total

While this fundamentally equivalent Java program takes about 23 milliseconds:

public class Main{
    public static void main(String args[]){
        float total = 0f; 

        long start_time = System.nanoTime();
        int i=1;

        while (i<=9999){
            int j=1;
            while(j<=9999){
                total+=1;
                j+=1;
            }
            i+=1;
        }
        long end_time = System.nanoTime();

        System.out.println("total: " + total);
        System.out.println("total milliseconds: " + 
           (end_time - start_time)/1000000);
    }
}

In terms of doing anything in a for loop, Java cleans python's clock by being between 1 and 1000 orders of magnitude faster.

Moral of the story: basic python for loops should be avoided at all costs if speedy performance is required. This could be because Guido van Rossum wants to encourage people to use multi-processor friendly constructs like array splicing, which operate faster than Java.

Solution 4

The benefit of Python is that there is a lot more flexibility (e.g. classes are objects) compared to Java (where you only have this reflection mechanism)

What's not mentioned here is Cython. It allows to introduce typed variables and trans-compile your example to C/C++. Then it's much faster. I've also changed the bounds in the loop ...

from __future__ import division

cdef double total = 0.00
cdef int i, j
for i in range(9999):
    for j in range(1, 10000+i):
        total += (i / j)

from time import time
t = time()
print("total = %d" % total)
print("time = %f[s]" % (time() - t))

Followed by

$ cython loops.pyx
$ gcc -I/usr/include/python2.7 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing -O3 -o loops.so loops.c
$ python -c "import loops"

gives

total = 514219068
time = 0.000047[s]

Solution 5

You will find that list comprehensions or generator expressions are significantly faster. For example:

total = sum(i / j for j in xrange(1, 9999) for i in xrange(9999))

This executes in ~11 seconds on my machine vs. ~26 for your original code. Still an order of magnitude slower than the Java, but that's more in line with what you'd expect.

Your original code can, by the way, be sped up slightly by initializing total to 0 rather than 0.0 to use integer rather than floating-point addition. Your divisions all have integer results, so there is no point in summing the results to a float.

On my machine, Psyco actually slows down the generator expressions to about the same speed as your original loop (which it does not accelerate at all).

View more solutions

66,239

Author by

Baskaya

Updated on April 23, 2021

Comments

Baskaya about 3 years
We are making some kNN and SVD implementations in Python. Others picked Java. Our execution times are very different. I used cProfile to see where I make mistakes but everything is quite fine actually. Yes, I use numpy also. But I would like to ask simple question.
```
total = 0.0
for i in range(9999): # xrange is slower according 
    for j in range(1, 9999):            #to my test but more memory-friendly.
        total += (i / j)
print total
```
This snippet takes 31.40s on my computer.

Java version of this code takes 1 second or less on the same computer. Type checking is a main problem for this code, I suppose. But I should make lots of operation like this for my project and I think 9999*9999 is not so big number.

I think I am making mistakes because I know Python is used by lots of scientific projects. But why is this code so slow and how can I handle problems bigger than this?

Should I use a JIT compiler such as Psyco?

EDIT

I also say that this loop problem is only an example. The code is not as simple as like this and It may be tough to put into practice your improvements/code samples.

Another question is that can I implement lots of data mining & machine learning algorithms with numpy and scipy if I use it correctly?
Baskaya over 12 years

Yes I know that but Why so many people use it for scientific research? Just the library supports? It is enough? and how can decrease the execution time because even a little loop is that slow. I know Python code is like a pseudocode [very readable] but lots of scientific projects use bigger loop than this. I can write lots of readable, small, cleaner code but their execution times go infinity :)
Matt Fenwick over 12 years

@Thorn -- some people use it for the reasons given, other people may have different reasons. Code size is very important, as it's directly related to how long the project takes to complete. Performance is not the only or most important factor to many people.
Baskaya over 12 years

OK. I accept and respect your reasons totally. But please tell me, is the loop above big for a scientific project?
Matt Fenwick over 12 years

@Thorn -- it's not the size of the loop, it's the fact that it's a nested loop, so it does total += i / j (10 ** 4) * (10 ** 4) = 10 ** 8 times! Is that a lot? Maybe -- how many protein structures are deposited in the PDB? 10 ** 4? How many atoms does a protein have -- 10 ** 3 or 10 ** 4? How big is the human genome -- 10 ** 9?
Baskaya over 12 years

Thanks. Nice but If I make larger than 9999, for example 99999? It is not so flexible?
Admin over 12 years

@Thorn: It's big, though not too outlandish for bigger scientific computing projects. That's why people who do such projects identify critical loops and push them into C code. The use Python because (1) it works very well for the other 80% of their code and (2) has neat scientific libraries that either have already solved their specific problem or are easy enough to glue together with Python to do what they need (NumPy). The result may well be faster than what they'd have written themselves (NumPy for instance is backed by vast amounts of hand-written and -optimized C and FORTRAN code).
Joe Kington over 12 years

Make it as large as you like. Keep in mind that this particular example will generate an x by x array, though, so you'll eventually run into memory constraints. In that case, you can iterate over one axis (i.e. one loop instead of nested loops) and still use vectorized operations.
Baskaya over 12 years

@MattFenwick Yes, it is big for related projects. But for example if you use kNN on Netflix or 10M Movielens Data, and you want to evaluate the accuracy of your Recommender System's recommendations by using 'Precision & Recall' approach, you see this loops are not so big.
Baskaya over 12 years

I asked because Python says ValueError: array is too big.
Joe Kington over 12 years

I added an example of iterating over one axis instead of both. f3 is much less memory-hungry, and should scale quite well.
Joe Kington over 12 years

You should have gotten a MemoryError, not a ValueError... Are you using numpy or something else? (Python as an array class that's something different altogether. People often get it and numpy.array confused.)
Baskaya over 12 years

I am using numpy Traceback (most recent call last): File "b.py", line 4, in <module> print np.divide.outer(i,j).sum() ValueError: array is too big.
n611x007 over 11 years

What makes using xrange in f3 faster than numpy functions in f2? I guess the / in f3 is also a numpy operator (also an __rdiv__ seems to be in effect) and .sum is from numpy, too.
Kr0e over 10 years

those while loops are not idiomatic python. Usung for _ in loops this code is almost 3 times faster using CPython and only a couple of seconds using pypy.
Joe Kington over 10 years

@naxa - Sorry I didn't notice your comment earlier! The reason that f2 is slower for large numbers is that it builds a num x num 2D array which eats up a lot of memory. f3 only builds up a num-length 1D array and repeatedly makes a copy of it to do the division on. Memory-allocation for a huge array is actually the bottleneck here, so f3 is faster, as it's easier to find contiguous space in memory for a small array.
Jim Raynor about 10 years

I love this straightforward example of using Scipy. Is there any book comes with the name like "Think Scipy" or "Think Numpy" you can suggest? Using loop is the normal fault for all people who bring their coding skill in C or Java to Python. Let Python call other (optimized) module to do the dirty jobs, that's the pythonic way :-)
Joe Kington about 10 years

@JimRaynor - There may a book along those lines, but I don't know of it. Scipy-lectures is an excellent overview of the entire ecosystem: scipy-lectures.github.io However, it's not quite what you had in mind. Another good option is to follow Jamie's, unutbu's, Sven's, or DSM's (etc) numpy answers.
Harald Schilly almost 10 years

What Sage actually does is to code a lot in Cython. That's a Python-like language (looks like it and mixed into same source code) but is trans-compiled to a C/C++ Python extension.
qed over 9 years

Are you sure it's 0.000047s? On my machine (i7 intel CPU), it's 0.42s.
qwerty over 8 years

i tested on raspberry pi total = 514219068 time = 0.000365[s]
anthonybell about 6 years

This is called vectorization - redefining your problem in terms of vector-vector, matrix-vector, or matrix-matrix operations and calling a linear algebra library to solve it.
gwideman over 5 years

"between 1 and 1000 orders of magnitude faster." .. so up to 1E1000 times faster? A loop-intensive operation that previously took one second, in Java takes only 1E-1000 seconds? That sounds quite attractive.
Overflow289 almost 5 years

It looks like you defined t = time() right before printing time() - t. You need to start your timer before the loop. Or is there something I misunderstand about how the file is being compiled?
vighnesh153 over 4 years

even Jave is interpreted by Jvm
Paul Kenjora over 2 years

I ran 3 nested loops with a range of 300 and both the while and for loop performed equally in python 3.8.