Cuda GPU is slower than CPU in simple numpy operation

10,114

Solution 1

Despite the example being on the web site of Nvidia used to show "how to use the GPU", plain matrix addition will be probably slower using GPU that using the CPU. Primarily due to the overhead of copying over the data to the GPU.

Even simple math calculations might be slower. Heavier computations can already show the gain. I've put my results together in an article showing speed improvement with GPU, cuda, and numpy

In a nutshell the question was which is bigger

CPU time

or

copy to GPU + GPU time + copy from GPU

Solution 2

Probably your array is too small and the operation too simple to offset the cost of data transfer associated to the GPU. Other way to see it, is that you're not being fair in your timing since for the GPU it also is timing the memory transfer time and not only the processing time.

Try some more challenging example, maybe first an element wise big matrix multiplication and then a matrix multiplication.

In the end, the power of the GPU is to perform many operations on the same data so you end up paying only once the data transfer cost.

Share:
10,114
szabgab
Author by

szabgab

I help improving engineering practices by providing training, mentoring, coaching. Implementing techniques and technologies. Introducing Unit, Integration, and Acceptance testing Continuous Integration (CI) Continuous Deployment (CD) Software and System Configuration Management Version Control Systems (e.g. Subversion, Git) Build system Perl, Python, JavaScript Linux Basics for Power Users Database integration (SQL, NoSQL) Web Application Development Test Automation and Quality Assurance (QA) Adding telemetry to products and services Chief editor and publisher of the Perl Weekly newsletter. Author of the Perl Maven site with the Perl tutorial on it. Also the Code Maven site.

Updated on June 11, 2022

Comments

  • szabgab
    szabgab almost 2 years

    I am using this code based on this article to see the GPU accelerations, but all I can see is slowdown:

    import numpy as np
    from timeit import default_timer as timer
    from numba import vectorize
    import sys
    
    if len(sys.argv) != 3:
        exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")
    
    
    @vectorize(["float32(float32, float32)"], target=sys.argv[1])
    def VectorAdd(a, b):
        return a + b
    
    def main():
        N = int(sys.argv[2])
        A = np.ones(N, dtype=np.float32)
        B = np.ones(N, dtype=np.float32)
    
        start = timer()
        C = VectorAdd(A, B)
        elapsed_time = timer() - start
        #print("C[:5] = " + str(C[:5]))
        #print("C[-5:] = " + str(C[-5:]))
        print("Time: {}".format(elapsed_time))
    
    main()
    

    The results:

    $ python speed.py cpu 100000
    Time: 0.0001056949986377731
    $ python speed.py cuda 100000
    Time: 0.11871792199963238
    
    $ python speed.py cpu 11500000
    Time: 0.013704434997634962
    $ python speed.py cuda 11500000
    Time: 0.47120747699955245
    

    I cannot send bigger vector as that will generate a numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE exception.`

    The output of nvidia-smi is

    Fri Dec  8 10:36:19 2017
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Quadro 2000D        Off  | 00000000:01:00.0  On |                  N/A |
    | 30%   36C   P12    N/A /  N/A |    184MiB /   959MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0       933      G   /usr/lib/xorg/Xorg                            94MiB |
    |    0       985      G   /usr/bin/gnome-shell                          86MiB |
    +-----------------------------------------------------------------------------+
    

    Details of the CPU

    $ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              4
    On-line CPU(s) list: 0-3
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               58
    Model name:          Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
    Stepping:            9
    CPU MHz:             3300.135
    CPU max MHz:         3700.0000
    CPU min MHz:         1600.0000
    BogoMIPS:            6600.27
    Virtualization:      VT-x
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            256K
    L3 cache:            6144K
    NUMA node0 CPU(s):   0-3
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
    

    The GPU is Nvidia Quadro 2000D with 192 CUDA cores and 1Gb RAM.

    More complex operation:

    import numpy as np
    from timeit import default_timer as timer
    from numba import vectorize
    import sys
    
    if len(sys.argv) != 3:
        exit("Usage: " + sys.argv[0] + " [cuda|cpu] N()")
    
    
    @vectorize(["float32(float32, float32)"], target=sys.argv[1])
    def VectorAdd(a, b):
        return a * b
    
    def main():
        N = int(sys.argv[2])
        A = np.zeros((N, N), dtype='f')
        B = np.zeros((N, N), dtype='f')
        A[:] = np.random.randn(*A.shape)
        B[:] = np.random.randn(*B.shape)
    
        start = timer()
        C = VectorAdd(A, B)
        elapsed_time = timer() - start
        print("Time: {}".format(elapsed_time))
    
    main()
    

    Results:

    $ python complex.py cpu 3000
    Time: 0.010573603001830634
    $ python complex.py cuda 3000
    Time: 0.3956961739968392
    $ python complex.py cpu 30
    Time: 9.693001629784703e-06
    $ python complex.py cuda 30
    Time: 0.10848476299725007
    

    Any idea why?