Why is boosts matrix multiplication slower than mine?

c++ performance boost ublas boost-ublas

23,219

Solution 1

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.

Here's the time taken by the uBLAS version with debugging on:

real    0m19.966s
user    0m19.809s
sys     0m0.112s

Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added):

real    0m7.061s
user    0m6.936s
sys     0m0.096s

So with debugging off, uBLAS version is almost 3 times faster.

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS":

An important design goal of ublas is to be as general as possible.

This generality almost always comes with a cost. In particular the prod function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod. Here are the results of comparing different methods:

ijkalgorithm   prod   axpy_prod  block_prod
   1.335       7.061    1.330       1.278

As you can see both axpy_prod and block_prod are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod (I used 64) can make the difference more profound.

Solution 2

I believe, your compiler doesn't optimize enough. uBLAS code makes heavy use of templates and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me

10.064s for uBLAS

7.851s for vector

The difference is still there, but by no means overwhelming. uBLAS's core concept is lazy evaluation, so prod(A, B) evaluates results only when needed, e.g. prod(A, B)(10,100) will execute in no time, since only that one element will actually be calculated. As such there's actually ~~no dedicated algorithm for whole matrix multiplication which could be optimized~~ (see below). But you could help the library a little, declaring

matrix<int, column_major> B;

will reduce running time to 4.426s which beats your function with one hand tied. This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage.

P.S. Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply whole matrices at once. 2 functions - axpy_prod and opb_prod. So

opb_prod(A, B, C, true);

even on unoptimized row_major B matrix executes in 8.091 sec and is on par with your vector algorithm

P.P.S. There's even more optimizations:

C = block_prod<matrix<int>, 1024>(A, B);

executes in 4.4s, no matter whether B is column_ or row_ major. Consider the description: "The function block_prod is designed for large dense matrices." Choose specific tools for specific tasks!

Solution 3

I created a little website Matrix-Matrix Product Experiments with uBLAS. It's about integrating a new implementation for the matrix-matrix product into uBLAS. If you already have the boost library it only consists of additional 4 files. So it is pretty much self-contained.

I would be interested if others could run the simple benchmarks on different machines.

23,219

Martin Thoma

I also have a blog about Code, the Web and Cyberculture (medium as well) and a career profile on Stackoverflow. My interests are mainly machine-learning, neural-networks, data-analysis, python, and in general backend development. I love building secure and maintainable systems.

Updated on January 23, 2020

Comments

Martin Thoma over 4 years
I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code)
```
Result result = read ();

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);
```
and another one with the standard algorithm (see full standard code):
```
vector< vector<int> > ijkalgorithm(vector< vector<int> > A, 
                                    vector< vector<int> > B) {
    int n = A.size();

    // initialise C with 0s
    vector<int> tmp(n, 0);
    vector< vector<int> > C(n, tmp);

    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    return C;
}
```
This is how I test the speed:
```
time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt

time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt
```
Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:
```
g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic
```
I got 15 seconds for my implementation and over 4 minutes for the boost-implementation!

edit: After compiling it with
```
g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out
```
I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. So Boost is still considerably slower.

Why is boost so much slower than my implementation?
- Mysticial almost 12 years
  
  The only time reinventing the wheel is a good idea is when you can make a better wheel...
- ildjarn almost 12 years
  
  Boost.uBLAS is meant to be a standard interface, not a robust implementation, so do not expect it to be fast unless you're using e.g. the LAPACK back-end.
- TJD almost 12 years
  
  Boost uBLAS has some optional debug checking that will slow things down. See this FAQ boost.org/doc/libs/1_49_0/libs/numeric/ublas/doc/index.htm, and check preprocessor macros BOOST_UBLAS_NDEBUG and NDEBUG
- sarnold almost 12 years
  
  You ran the boost test first here -- was the data already in the buffer cache or was that the first execution? Or, how many times did you run both tests?
- Mysticial almost 12 years
  
  Though it shouldn't take 4 minutes to read a couple of 2k-by-2k matrices.
- Jesse Good almost 12 years
  
  Could you provide the text file you used?
- stinky472 almost 12 years
  
  @Mysticial tricky part is that most people reinventing the wheel are generally convinced that it's better regardless of whether it actually is or not, or else they probably wouldn't be doing it in the first place. :-D
- Crashworks almost 12 years
  
  Why not profile your Boost implementation and see where it's spending its time?
- mfontanini almost 12 years
  
  You should measure the time taken only by the matrix multiplication. That way you'd be avoiding measuring IO operations, which take lots of time.
- user541686 almost 12 years
  
  @Mysticial: Well, even if it's not better, it might still be worth it if it takes less time than actually finding a wheel... :-)
- dyp almost 12 years
  
  Could you try out another compiler? On Windows, I get approx 10/1 ratio (time consumption ublas)/(time consumption your STL version) on g++ and 1/2 ratio on MSVC (with boost being faster!) when timing both methods (w/o IO) inside the same program. Which one is performed first seems to have no impact.
- Martin Thoma almost 12 years
  
  @JesseGood everything is in this GIT-Repository: github.com/MartinThoma/matrix-multiplication If you like to read a little bit more about why I tried it: I've made a blogpost about the topic martin-thoma.com/matrix-multiplication-python-java-cpp
mfontanini almost 12 years

Could you run the same test using the version OP provided?
vitaut almost 12 years

@mfontanini: Sure, I updated the answer. Note that I used smaller (1000x1000) random matrices so all the times are smaller.
dyp almost 12 years

As I already commented, on my machine/compiler combination (VS 9), fully optimizing, the op's boost version actually runs faster than the vector version, when only timing the computation (no IO). From disassembly, I'd guess the vector version could have been inlined/streamlined better by gcc, with for-loop unfolding etc. On the other hand, vector < vector > needs several allocations (optimizations possible?), where boost can use one for the whole matrix.
vitaut almost 12 years

Both times look large to me, on my machine vector version takes just 1.3 seconds for 1000x1000 random matrix. What machine are you testing on?
panda-34 almost 12 years

@vitaut, It's a Pentium M 1600 notebook :)
dyp almost 12 years

+1 From my measurements, I now can conclude: using iterators instead of integers + element access is 3x faster on VC9 but not faster on g++(4.6.2). ikj-algorithm applied to boost matrices is slower than vector+iterators (6x on MSVC and 10x on gcc) no matter if using iterators or not. block_prod is as fast as vector+iterators with both compilers. Timing has been made roughly and excludes both IO and allocations.
Bryan Fok over 6 years

above link is broken