Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

python numpy cuda precision

18,511

Solution 1

I would recommend using np.allclose for testing whether two float arrays are nearly equal.

Whereas you are only looking at the absolute difference between the values in your two result arrays, np.allclose also considers their relative differences. Suppose, for example, that the values in your input arrays were 1000x greater - then the absolute differences between the two results will also be 1000x greater, but that doesn't mean the two dot products were any less precise.

np.allclose will return True only if the following condition is met for every corresponding pair of elements in your two test arrays, a and b:

abs(a - b) <= (atol + rtol * abs(b))

By default, rtol=1e-5 and atol=1e-8. These tolerances are a good 'rule of thumb', but whether they are small enough in your case will depend on your particular application. For example, if you're dealing with values < 1e-8, then an absolute difference of 1e-8 would be a total disaster!

If you try calling np.allclose on your two results with the default tolerances, you'll find that np.allclose returns True. My guess, then, is that these differences are probably small enough that they're not worth worrying about. It really depends on what you're doing with the results.

Solution 2

The RTX cards do floating point at half-precision because its faster for image rendering. You must tell the GPU to use full precision when multiplying floating point for AI. The precision is extremely important when doing AI.

I experienced this same Floating point difference you did when trying to use Cuda with an RTX 2080 Ti.

18,511

Ottokar

Updated on September 15, 2022

Comments

Ottokar over 1 year
I'm using gnumpy to speed up some computations in training a neural network by doing them on GPU. I'm getting the desired speedup but am a little bit worried about the differences in the results of numpy (cpu) vs gnumpy (gpu).

I have the following test script to illustrate the problem:
```
import gnumpy as gpu
import numpy as np

n = 400

a = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
b = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print ga.as_numpy_array(dtype=np.float32) - a
```
which provides the output:
```
[[  1.52587891e-05  -2.28881836e-05   2.28881836e-05 ...,  -1.52587891e-05
    3.81469727e-05   1.52587891e-05]
 [ -5.34057617e-05  -1.52587891e-05   0.00000000e+00 ...,   1.52587891e-05
    0.00000000e+00   1.52587891e-05]
 [ -1.52587891e-05  -2.28881836e-05   5.34057617e-05 ...,   2.28881836e-05
    0.00000000e+00  -7.62939453e-06]
 ..., 
 [  0.00000000e+00   1.52587891e-05   3.81469727e-05 ...,   3.05175781e-05
    0.00000000e+00  -2.28881836e-05]
 [  7.62939453e-06  -7.62939453e-06  -2.28881836e-05 ...,   1.52587891e-05
    7.62939453e-06   1.52587891e-05]
 [  1.52587891e-05   7.62939453e-06   2.28881836e-05 ...,  -1.52587891e-05
    7.62939453e-06   3.05175781e-05]]
```
As you can see, the differences are around the magnitude of 10^-5.

So the question is: should I be worried about these differences or is this the expected behaviour?

Additional information:
- GPU: GeForce GTX 770;
- numpy version: 1.6.1
I noticed the problem when I used gradient checking (with finite difference approximation) to verify that the small modifications I made to switch from numpy to gnumpy didn't break anything. As one may expect the gradient checking did not work with 32 bit precision (gnumpy does not support float64), but to my surprise the errors differed between CPU and GPU when using the same precision.

The errors on CPU and GPU on a small test neural network are given below:

Since the error magnitudes are similar, I guess that these differences are OK?

After reading the article, referenced in the comment by BenC, I'm quite sure that the differences can be mostly explained by one of the devices using the fused multiply-add (FMA) instruction and the other not.

I implemented the example from the paper:
```
import gnumpy as gpu
import numpy as np

a=np.array([1.907607,-.7862027, 1.147311, .9604002], dtype=np.float32)
b=np.array([-.9355000, -.6915108, 1.724470, -.7097529], dtype=np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print "CPU", a
print "GPU", ga
print "DIFF", ga - a

>>>CPU 0.0559577
>>>GPU 0.0559577569366
>>>DIFF 8.19563865662e-08
```
...and the difference is similar to FMA vs serial algorithm (though for some reason both results differ from the exact result more than in the paper).

The GPU I'm using (GeForce GTX 770) supports FMA instruction while the CPU does not (I have an Ivy Bridge Intel® Xeon(R) CPU E3-1225 V2, but intel introduced the FMA3 instruction in their products with Haswell).

Other possible explanations include the different math libraries used in the background or differences in the sequence of operations caused by, for example, the different level of parallelization on CPU vs GPU.
- BenC over 10 years
  
  Here's a good read for you: Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs
- HyperCube over 10 years
  
  A difference of 10^-5 can be negligible or enormous depending on your input data. What order of magnitude does your input data have?