Speedup GPU vs CPU for matrix operations

15,699

Solution 1

Matrix multiplication performance

If you use numpy, you are probably using one of the BLAS libraries as computational backend, such as ATLAS, OpenBLAS, MKL, etc. When you are using the fastest one MKL, you can find a recent performance benchmark here, between a recent Nvidia GPU K40m and Intel Xeon 12-core E5-2697 v2 @ 2.70GHz

https://developer.nvidia.com/cublas

where K40m is 6x faster than 12-thread E5-2697. Considering MKL scales well on multi-core CPU. K40m is ~72x faster than 1-thread E5-2697. Please also note 1000-dim is almost the lower bound to fully utilise both the GPU and CPU. Smaller matrix size usually leads to more performance degrade on GPU.

If you are using slower BLAS backend for numpy, say the GNU-licensed ATLAS. You could then find the comparison between MKL and ATLAS here

https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM-ATLAS

where MKL is 2~4x faster than ATLAS.

For Nvidia GPUs, the only widely used backend is CUDA's cuBLAS, so the performance won't change a lot like ATLAS vs. MKL.

Data transfer

As @janbrohl says, data transfer between host RAM and GPU device memory is an important factor that affect the overall performance. Here's a benchmark of the data transfer speed.

CUDA - how much slower is transferring over PCI-E?

Given the matrix size, you can actually calculate out the absolute time for computation and data transfer, respectively. These could help you evaluate the performance better.

To maximise the performance on GPU, you probably need re-design you program to minimise the data transfer, by moving all the computational operations to GPU, rather than matrix multiplication only.

Solution 2

Generally speaking GPUs are much faster than CPU at highly parallel simple tasks (that is what they are made for) like multiplying big matrices but there are some problems coming with GPU computation:

  • transfering data between normal RAM and graphics RAM takes time
  • loading/starting GPU programs takes some time

so while multiplication itself may be 100 (or more) times faster, you might experience an actually much smaller speedup or even a slowdown

There are more issues with GPUs being "stupid" in comparison to CPUs like massive slowdowns on branching code, having to handle caching by hand and others which can make writing fast programs for GPUs quite challenging.

Solution 3

Using opencl api, I tried 8k X 8k by 8k X 8k multiplication on a 1280-core HD7870(not even a mainstream desktop grade gpu) and it took about 0.99 seconds which means about 540 billion additions and 540 billion multiplications which also means 1.1 Tflops(%40 of its peak value said in its advertisements). High-end desktop grade CPUs have only 0.2 - 0.3 Tflops(peak value) excluding their integrated gpus. So best cpus cannot even reach a low-mid gpu in both performance and performance per watt and performance per dollar.

Key options for performance:

  • calculations in patches like 32x32 or 48x48 (for each compute unit having a group of threads so each thread computing a part of patch or sum of all patches of a column/row)
  • doing exponentially faster ways like Strassen's algorithm.
  • pipelining read,write and compute oprations so consecutive iterations stack gainfully.
  • optimizing for hardware differencies

  • using a library that has options from 1 to 4

Share:
15,699
physicsGuy
Author by

physicsGuy

This user is being kept inside a cloud of mystery.

Updated on June 21, 2022

Comments

  • physicsGuy
    physicsGuy almost 2 years

    I am wondering how much GPU computing would help me speed up my simulations.

    The critical part of my code is matrix multiplication. Basically the code looks like the following python code with matrices of order 1000 and long for loops.

    import numpy as np
    m_size = 1000
    sim_length = 50
    
    a = np.random.rand(m_size, m_size)
    b = np.random.rand(m_size, m_size)
    
    for j in range(sim_length):
        result = np.dot(a,b)
    

    Note: My matrices are dense, mostly random and for loops are compiled with cython.

    My naive guess would be that I have two factors:

    • More parallel threads (Currently of order 1 thread, GPUs of order 100 threads?) --> Speedup of order 100? [Source is quite outdated, from 2011]
    • Lower processor frequency (Currently 3Ghz, GPUs typically 2 Ghz) --> Neglect

    I expect that this viewpoint is to naive, so what am I missing?