Benchmarking (python vs. c++ using BLAS) and (numpy)

44,360

Solution 1

I've run your benchmark. There is no difference between C++ and numpy on my machine:

woltan's benchmark

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

Solution 2

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC. Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
  • RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
  • RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

  • OS: Suse
  • Intel Compiler: ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL --> Nehalem MKL, Netlib Blas --> Nehalem Netlib BLAS, etc)

Matrix multiplication (sizes=[1000,2000,3000,5000,8000])

Single threaded performance: single threaded performance

Multi threaded performance (8 threads): multi-threaded (8 threads) performance

Threads vs Matrix size (Ivy Bridge MKL): Matrix-size vs threads

Benchmark Suite

benchmark suite

Single threaded performance: enter image description here

Multi threaded (8 threads) performance: enter image description here

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test. The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode. In multi-threaded mode the performance is worse.

The "Matrix size vs threads chart" also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won't improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.


Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

  • Numpy with ATLAS
  • Numpy with GotoBlas2 (1.13)
  • Numpy with MKL (11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

  1. simple dot product of matrices with different sizes
  2. Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

  • OS: Ubuntu Lucid 10.4 64 Bit.
  • CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
  • RAM: 24 GB
  • Intel Compiler: 11.1/073
  • Scipy: 0.8
  • Numpy: 1.5

Mac Book Pro (Accelerate Framework):

  • OS: Mac OS X Snow Leopard (10.6)
  • CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
  • RAM: 4 GB
  • Scipy: 0.7
  • Numpy: 1.3

Mac Server (Accelerate Framework):

  • OS: Mac OS X Snow Leopard Server (10.6)
  • CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
  • RAM: 4 GB
  • Scipy: 0.8
  • Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Dot product benchmark - chart

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Benchmark suite - chart

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn't work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu's repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple's Accelerate Framework. You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple's Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don't use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD's ACML and performance was similar to MKL and GotoBlas2. I don't have any numbers tough.

I personally would recommend to use GotoBlas2 because it's easier to install and it's free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.

Solution 3

Here's another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ --- and if this matters, your benchmark is probably broken.)

Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function -- it's OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.


To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can't import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing! The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL --- check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).

Solution 4

Given the rigor you've shown with your analysis, I'm surprised by the results thus far. I put this as an 'answer' but only because it's too long for a comment and does provide a possibility (though I expect you've considered it).

I would've thought the numpy/python approach wouldn't add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I'm more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you're using the best algorithms that numpy can leverage. From the compilation guide for linux:

"Build FFTW (3.1.2): SciPy Versions >= 0.7 and Numpy >= 1.2: Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack. There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis. Downgrade to a Numpy/Scipy version that includes support. Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example."

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you're running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

If you're on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you've considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I'll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.

Share:
44,360

Related videos on Youtube

Woltan
Author by

Woltan

Updated on July 08, 2022

Comments

  • Woltan
    Woltan almost 2 years

    I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

    I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

    1. Numpy, making use only of the functionality of dot.
    2. Python, calling the BLAS functionalities through a shared object.
    3. C++, calling the BLAS functionalities through a shared object.

    Scenario

    I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

    m1 = numpy.random.rand(i,i).astype(numpy.float32)
    m2 = numpy.random.rand(i,i).astype(numpy.float32)
    

    1. Numpy

    The code used looks like this:

    tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
    rNumpy.append((i, tNumpy.repeat(20, 1)))
    

    2. Python, calling BLAS through a shared object

    With the function

    _blaslib = ctypes.cdll.LoadLibrary("libblas.so")
    def Mul(m1, m2, i, r):
    
        no_trans = c_char("n")
        n = c_int(i)
        one = c_float(1.0)
        zero = c_float(0.0)
    
        _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
                byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
                m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
                r.ctypes.data_as(ctypes.c_void_p), byref(n))
    

    the test code looks like this:

    r = numpy.zeros((i,i), numpy.float32)
    tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
    rBlas.append((i, tBlas.repeat(20, 1)))
    

    3. c++, calling BLAS through a shared object

    Now the c++ code naturally is a little longer so I reduce the information to a minimum.
    I load the function with

    void* handle = dlopen("libblas.so", RTLD_LAZY);
    void* Func = dlsym(handle, "sgemm_");
    

    I measure the time with gettimeofday like this:

    gettimeofday(&start, NULL);
    f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
    gettimeofday(&end, NULL);
    dTimes[j] = CalcTime(start, end);
    

    where j is a loop running 20 times. I calculate the time passed with

    double CalcTime(timeval start, timeval end)
    {
    double factor = 1000000;
    return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
    }
    

    Results

    The result is shown in the plot below:

    enter image description here

    Questions

    1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
    2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
    3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

    Download

    The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

    • rocksportrocker
      rocksportrocker over 12 years
      in your ctypes approach, you have the memory allocation inside the measured function. Does your c++ code follow this approach ? But compared to matrix multiplication this should not make a big difference....
    • Woltan
      Woltan over 12 years
      @rocksportrocker You are correct. The memory allocation for the r matrix is unfair. I am resolving the "issue" right now and post the new results.
    • jfs
      jfs over 12 years
      1. make sure that arrays have the same memory layout np.ascontiguousarray() (consider C vs. Fortran order). 2. make sure that np.dot() uses the same libblas.so.
    • Woltan
      Woltan over 12 years
      @J.F.Sebastian Both arrays m1 and m2 have the ascontiguousarray flag as True. And numpy uses the same shared object as C does. As for the order of the array: Currently I am not interested in the result of the computation hence the order is irrelevant.
    • jfs
      jfs over 12 years
      @Woltan: np.asfortranarray() vs. np.ascontiguousarray() doesn't change the result of the computation but it might affect performance. Though for np.dot() implementation I don't see any difference.
    • HYRY
      HYRY over 12 years
      I can't believe that C++ version is 200x faster than numpy.dot. maybe C++ compiler does some unexpected optimization.
    • Woltan
      Woltan over 12 years
      @HYRY Where should the optimization come from? The time it takes C++ to run the BLAS library function is pretty constant though all 20 iterations.
    • jfs
      jfs over 12 years
      @Woltan: Here's the paper that discusses an influence of memory layout on performance: Fast numerical computations with Cython
    • jfs
      jfs over 12 years
      @Woltan: don't use filefactory the service is awful. I've added your benchmark to github: woltan-benchmark. If you use github I could add you as a collaborator.
    • D R
      D R about 9 years
      I ran your benchmarks under Python 3 and get similar discrepancy between Python and C++. While I understand that numpy performance depends on the blas library it links against, I am at a loss as to why there is a difference between the Python/BLAS and C++/BLAS as they both link to the same library. How were you able to resolve this?
    • Daniel Szulc
      Daniel Szulc almost 4 years
      Even though thread is quite old I would like to post my results on i3-4150 which is quite old too :). Linux, numpy with MKL Plot returned by "make"
    • AlphaF20
      AlphaF20 over 3 years
      Sorry for commenting on this old post. Is there any explanation so far, for numpy-BLAS is at-least one order of magnitude slower than C++?
  • Woltan
    Woltan over 12 years
    Thank you for your elaborate "comment"^^. To clarify my python/numpy/BLAS setup: I followed this installation guide. I am on a linux os and the versions are: Python 2.7, Scipy 0.9 Numpy 1.6. Unfortunately I did not build FFTW before hand, neither did I use mkl...
  • Profane
    Profane over 12 years
    In a way, it is fortunate. It means there is huge room for improvement in the python results and it sounds like you'd like to use python. I think if you amend your build to the one shown on the link you will be much happier with numpy's speed, though I'd still be fascinated to see how it compared to your C++ implementation.
  • Profane
    Profane over 12 years
    You could try to build ATLAS too, but that sounded like too many headaches for my performance needs, so I haven't any experience. I imagine if you're interested in using python but able to use C++, there'd be some point at which the setup cost of doing lots of special compilation would outweigh the language savings, and it'd be easier to do c++. But mkl and fftw should both be pretty straightforward.
  • jfs
    jfs over 12 years
    I've run your benchmark; the results are the same
  • Woltan
    Woltan over 12 years
    Thank you very much for your posting. I ran your benchmark with this result. So I cannot reproduce yours. To check which BLAS my numpy is using: I cannot import numpy.core._dotblas. What could be the problem here? I will try to clean up my benchmark an write a makefile in order for other to test it.
  • pv.
    pv. over 12 years
    @Woltan: the fact that you can't import numpy.core._dotblas means that your Numpy is using its internal fallback copy of BLAS (slower, and not meant to be used in performance computing!), rather than the BLAS library you have on your system. This explains the results you got from the benchmark. To fix the situation, you need to install a BLAS version Numpy can work with --- which means ATLAS or MKL. Here's a set of instructions: scipy.org/Installing_SciPy/Linux
  • jfs
    jfs over 12 years
    @pv.: Could you run Woltan's benchmark to compare results.
  • pv.
    pv. over 12 years
    @J.F.Sebastian: Sure, here. It agrees with my benchmark, and does not show a difference between the different approaches.
  • Woltan
    Woltan over 12 years
    Thank you very much for this elaborate answer!
  • Woltan
    Woltan over 12 years
    @pv. I installed ATLAS as described in the link you posted. numpy.core._dotblas.__file__ is now possible but I still get a rather large discrepancy between the c++ and numpy performance. Any idea what could solve the problem?
  • pv.
    pv. over 12 years
    Your C++ performance is puzzling. I'm not able to reproduce that, nor does anyone else seem to have succeeded. The 100x difference seems impossible, so I'd double check the C++ code you have: are the matrix sizes the same, does the call produce the same results, etc. Also check against Fortran.
  • RichVel
    RichVel over 10 years
    On Mac, you can use otool -L instead of the ldd on Linux
  • Admin
    Admin almost 10 years
    Very comprehensive, thank you! I wonder, three years later, whether OpenBLAS (as far as I know, it is a descendant of GotoBLAS) would perform better. I have read somewhere that it outperforms MKL, but can't find the source right now.
  • Admin
    Admin almost 10 years
    Thanks! This is my impression to0 (I was wondering whether this was just my installation): OpenBLAS does not perform so well in multi-threaded mode when it comes to diagonalizing matrices (I diagonalize in scipy, which is linked to OpenBLAS).
  • Ümit
    Ümit almost 10 years
    @William: usually you don't need to specifically link scipy to openblas because it will use the numpy config during install and actually most BLAS/Lapack calls will be forwarded to numpy anyways. So if numpy is properly linked against openblas, everything should work fine.
  • Admin
    Admin almost 10 years
    @Ümit: Thanks! I am trying to setup numpy to link against MKL now.
  • Sturla Molden
    Sturla Molden over 9 years
    Currently MKL, Accelerate and OpenBLAS are similar in performance. OpenBLAS is more scalable than MKL, though.
  • wmac
    wmac almost 9 years
    So what did the original poster do wrong? I wish he had commented on this post. Does he confirm that Numpy is as fast as C++?
  • eriophora
    eriophora almost 8 years
    Sorry I'm a bit confused. The original benchmark appeared to compare C++ performance to Numpy, however, this is only for Numpy with different linalg libraries and architectures. Am I missing something?
  • cdcdcd
    cdcdcd almost 8 years
    You're C++ code is running slower than the original posters. Did you compile under optimisation?
  • jfs
    jfs almost 8 years
    @cdcdcd it is not my code. Click the link and run the benchmark yourself with different optimization options (see the Makefile). Though the code doesn't recompile neither blas nor lapack.