Python: rewrite a looping numpy math function to run on GPU

12,558

Solution 1

Introduction and solution code

Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.

Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.

Listing out the PyCUDA-NumPy solution code -

1] PyCUDA part -

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK

__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
    int tid = threadIdx.x;
    int inv = TBP;
    __shared__ float dS[TBP][2];

    dS[tid][0] = d1[tid+offset];  
    dS[tid][1] = d2[tid+offset];         
    __syncthreads();

    if(tid<L-TBP)  
    {
        dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
        dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
    }
    __syncthreads();
    inv = inv/2;

    while(inv!=0)   
    {
        if(tid<inv)
        {
            dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
            dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
        }
        __syncthreads();
        inv = inv/2;
    }
    __syncthreads();

    if(tid==0)
    {
        out[0] = dS[0][0];
        out[1] = dS[0][1];
    }   
    __syncthreads();
}

__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
    int L = BLOCKLEN[0];
    int tid = threadIdx.x;
    int iterID = blockIdx.x;
    float Bmax_Cmin[2];
    int inv;
    float Cmin, dif;   
    __shared__ float dS[TBP*2];   

    get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);  
    Cmin = Bmax_Cmin[1];
    dif = (Bmax_Cmin[0] - Cmin);

    inv = TBP;

    dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
    __syncthreads();

    if(tid<L-TBP)  
        dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);                   

     dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
     __syncthreads();

     if(tid<L-TBP)
         dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
     __syncthreads();

    inv = inv/2;
    while(inv!=0)   
    {
        if(tid<inv)
            dS[tid] += dS[tid+inv];
        __syncthreads();
        inv = inv/2;
    }

    if(tid==0)
        out[iterID] = dS[0];
    __syncthreads();

}
""")

Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.

2] NumPy part -

def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
    func1 = mod.get_function("main1")
    outlen = len(A)-batchSize+1

    # Set block and grid sizes
    BSZ = (1024,1,1)
    GSZ = (outlen,1)

    dest = np.zeros(outlen).astype(np.float32)
    N = np.int32(batchSize)
    func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                     drv.In(data2b), drv.In(data2a),\
                     drv.In(N), block=BSZ, grid=GSZ)
    idx = np.flatnonzero(dest >= minimumLimit)
    return idx, dest[idx]

Benchmarking

I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.

Original approach -

def org_app(data1, batchSize, minimumLimit):
    resultArray = []
    for rowNr in  range(data1.shape[0]-batchSize+1):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result]) 
    return resultArray

Timings and verification -

In [2]: #Declare variables
   ...: batchSize = 2000
   ...: sampleSize = 50000
   ...: resultArray = []
   ...: minimumLimit = 490 #use 400 on the real sample data
   ...: 
   ...: #Create Random Sample Data
   ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
   ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: 
   ...: # Make column copies
   ...: A = data1[:,0].copy()
   ...: B = data1[:,1].copy()
   ...: C = data1[:,2].copy()
   ...: D = data1[:,3].copy()
   ...: 
   ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
   ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
   ...: print(np.allclose(gpu_out1, cpu_out1))
   ...: print(np.allclose(gpu_out2, cpu_out2))
   ...: 
True
False

So, there's some differences between CPU and GPU countings. Let's investigate them -

In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))

In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])

In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1.,  1.,  1.])

There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.

Finally, puting the issue aside, let's get to the most important bit, the performance -

In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop

In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop

In [12]: 2180.0/82.5
Out[12]: 26.424242424242426

Let's try with bigger datasets. With sampleSize = 500000, we get -

In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop

In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop

In [16]: 23200.0/821
Out[16]: 28.25822168087698

So, the speedup stays constant at around 27.

Limitations :

1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.

Further improvement :

1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

Solution 2

Tweak #1

Its usually advised to vectorize things when working with NumPy arrays. But with very large arrays, I think you are out of options there. So, to boost performance, a minor tweak is possible to optimize on the last step of summing.

We could replace the step that makes an array of 1s and 0s and does summing :

np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

with np.count_nonzero that works efficiently to count True values in a boolean array, instead of converting to 1s and 0s -

np.count_nonzero((abcd <= data2a) & (abcd >= data2b))

Runtime test -

In [45]: abcd = np.random.randint(11,99,(10000))

In [46]: data2a = np.random.randint(11,99,(10000))

In [47]: data2b = np.random.randint(11,99,(10000))

In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
10000 loops, best of 3: 81.8 µs per loop

In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
10000 loops, best of 3: 28.8 µs per loop

Tweak #2

Use a pre-computed reciprocal when dealing with cases that undergo implicit broadcasting. Some more info here. Thus, store reciprocal of dif and use that instead at the step :

((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ...

Sample test -

In [52]: A = np.random.rand(10000)

In [53]: dif = 0.5

In [54]: %timeit A/dif
10000 loops, best of 3: 25.8 µs per loop

In [55]: %timeit A*(1.0/dif)
100000 loops, best of 3: 7.94 µs per loop

You have four places using division by dif. So, hopefully this would bring out noticeable boost there too!

Solution 3

Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the -tag so I'll use it to improve the code: First we operate on arrays not on matrices:

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:

abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

This creates an intermediate array each step:

  • tmp1 = A-Cmin,
  • tmp2 = tmp1 / dif,
  • tmp3 = B - Cmin,
  • tmp4 = tmp3 / dif
  • ... you get the gist.

However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".

import numba as nb

@nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
    Bmax = np.max(tmpData[:, 1])
    Cmin = np.min(tmpData[:, 2])
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    sum_ = 0
    for i in range(tmpData.shape[0]):
        val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

I did something else here to avoid multiple operations:

(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)

This actually cuts down the execution time by almost a factor of 10 on my computer:

%timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop

There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:

import time
import numpy as np
import numba as nb

@nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    quarter_idiff = 0.25 * idiff
    sum_ = 0
    for i in range(abcd.shape[0]):
        val = abcd[i] * quarter_idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

@nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
    found = 0
    for rowNr in range(data1.shape[0]):
        if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
            result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                    data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
            if (result >= minimumLimit):
                resultArray[found, 0] = rowNr
                resultArray[found, 1] = result
                found += 1
    return resultArray[:found]

#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))  # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))

result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)

This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!

With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?

And even if I haven't done this myself, says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

Solution 4

Here is some code to demonstrate what is possible by just tweaking the algorithm. It's pure numpy but on the sample data you posted gives a roughly 35x speedup over the original version (~1,000,000 samples in ~2.5sec on my rather modest machine):

>>> result_dict = master('run')
[('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
TOTAL 2.44821691513

Tweaks used:

  • A+B+C+D, see my other answer
  • running min/max, including avoiding to compute (A+B+C+D - 4Cmin)/(4dif) multiple times with the same Cmin/dif.

These are more or less routine. That leaves the comparison with data2a/b which is expensive O(NK) where N is the number of samples and K is the size of the window. Here one can take advantage of the relatively well-behaved data. Using the running min/max one can create variants of data2a/b that can be used to test a range of window offsets at a time, if the test fails all these offsets can be ruled out immediately, otherwise the range is bisected.

import numpy as np
import time

# global variables; they will hold the precomputed pre-screening filters
preA, preB = {}, {}
CHUNK_SIZES = None

def sliding_argmax(data, K=2000):
    """compute the argmax of data over a sliding window of width K

    returns:
        indices  -- indices into data
        switches -- window offsets at which the maximum changes
                    (strictly speaking: where the index of the maximum changes)
                    excludes 0 but includes maximum offset (len(data)-K+1)

    see last line of compute_pre_screening_filter for a recipe to convert
    this representation to the vector of maxima
    """
    N = len(data)
    last = np.argmax(data[:K])
    indices = [last]
    while indices[-1] <= N - 1:
        ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
        if len(ge) == 0:
            if last + K >= N:
                break
            last += 1 + np.argmax(data[last + 1 : last + K + 1])
            indices.append(last)
        else:
            last += 1 + ge[0]
            indices.append(last)
    indices = np.array(indices)
    switches = np.where(data[indices[1:]] > data[indices[:-1]],
                        indices[1:] + (1-K), indices[:-1] + 1)
    return indices, np.r_[switches, [len(data)-K+1]]


def compute_pre_screening_filter(bound, n_offs):
    """compute pre-screening filter for point-wise upper bound

    given a K-vector of upper bounds B and K+n_offs-1-vector data
    compute K+n_offs-1-vector filter such that for each index j
    if for any offset 0 <= o < n_offs and index 0 <= i < K such that
    o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j

    therefore the number of data points below filter is an upper bound for
    the maximum number of points below bound in any K-window in data
    """
    pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
    padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
    indices, switches = sliding_argmax(padded, n_offs)
    return padded[indices].repeat(np.diff(np.r_[[0], switches]))


def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
    """compute upper and lower pre-screening filters for data blocks of
    sizes K+n_offs-1 where
    n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk

    the result is stored in global variables preA and preB
    """
    global CHUNK_SIZES

    CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
    preA[1] = upper
    preB[1] = lower
    for n in CHUNK_SIZES:
        preA[n] = compute_pre_screening_filter(upper, n)
        preB[n] = -compute_pre_screening_filter(-lower, n)


def test_bounds(block, counts, threshold=400):
    """test whether the windows fitting in the data block 'block' fall
    within the bounds using pre-screening for efficient bulk rejection

    array 'counts' will be overwritten with the counts of compliant samples
    note that accurate counts will only be returned for above threshold
    windows, because the analysis of bulk rejected windows is short-circuited

    also note that bulk rejection only works for 'well behaved' data and
    for example not on random numbers
    """
    N = len(counts)
    K = len(preA[1])
    r = N % CHUNK_SIZES[0]
    # chop up N into as large as possible chunks with matching pre computed
    # filters
    # start with small and work upwards
    counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                   (block[l:l+K] >= preB[1]))
                  for l in range(r)]

    def bisect(block, counts):
        M = len(counts)
        cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
        if cnts < threshold:
            counts[:] = cnts
            return
        elif M == CHUNK_SIZES[0]:
            counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                          (block[l:l+K] >= preB[1]))
                         for l in range(M)]
        else:
            M //= 2
            bisect(block[:-M], counts[:M])
            bisect(block[M:], counts[M:])

    N = N // CHUNK_SIZES[0]
    for M in CHUNK_SIZES:
        if N % 2:
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M
        elif N == 0:
            return
        N //= 2
    else:
        for j in range(2*N):
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M


def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
            threshold=400):
    samples, upper, lower = data
    N, K = samples.shape[0], upper.shape[0]
    times = [time.time()]
    if use_pre_screening:
        compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
    times.append(time.time())
    # compute switching points of max and min for running normalisation
    upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
    lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
    times.append(time.time())
    # sum columns
    ABCD = samples.sum(axis=-1)
    times.append(time.time())
    counts = np.empty((N-K+1,), dtype=int)
    # main loop
    # loop variables:
    offs = 0
    u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
    l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
    while True:
        # check which is switching next, min(C) or max(B)
        if u_swp > l_swp:
            # greedily take the largest block possible such that dif and Cmin
            # do not change
            block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:l_swp], threshold=threshold)
            else:
                counts[offs:l_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(l_swp - offs)]
            # book keeping
            l_ind += 1
            offs = l_swp
            l_swp = lower_swp[l_ind]
            l_scale = samples[lower_inds[l_ind], 2]
        else:
            block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:u_swp], threshold=threshold)
            else:
                counts[offs:u_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(u_swp - offs)]
            u_ind += 1
            if u_ind == len(upper_inds):
                assert u_swp == N-K+1
                break
            offs = u_swp
            u_swp = upper_swp[u_ind]
            u_scale = samples[upper_inds[u_ind], 1]
    times.append(time.time())
    return {'counts': counts, 'valid': np.where(counts >= 400)[0],
            'timings': np.diff(times)}


def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
           min_chnk=None, dyads=None):
    t = time.time()
    if data in ('fake', 'load'):
        data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
                           usecols=[1,2,3,4])
        data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        if data == 'fake':
            data1 = np.tile(data1, (10, 1))
        threshold = 400
    elif data == 'random':
        data1 = np.random.random((102000, 4))
        data2b = np.random.random(2000)
        data2a = np.random.random(2000)
        threshold = 490
        if use_pre_screening or mode == 'calibrate':
            print('WARNING: pre-screening not efficient on artificial data')
    else:
        raise ValueError("data mode {} not recognised".format(data))
    data = data1, data2a, data2b
    t_load = time.time() - t
    if mode == 'calibrate':
        min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
        dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
        timings = np.zeros((len(min_chnk), len(dyads)))
        print('max bisect  ' + ' '.join([
            '   n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
              end='')
        for i, mc in enumerate(min_chnk):
            print('\nmin chunk {}'.format(mc), end=' ')
            for j, dy in enumerate(dyads):
                for k in range(nrep):
                    if dy == 0: # no pre-screening
                        timings[i, j] += analyse(
                            data, False, mc, dy, threshold)['timings'][3]
                    else:
                        timings[i, j] += analyse(
                            data, True, mc, dy, threshold)['timings'][3]
                timings[i, j] /= nrep
                print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
        best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
                                            timings.shape)
        print('\nbest', min_chnk[best_mc], dyads[best_dy])
        return timings, min_chnk[best_mc], dyads[best_dy]
    if mode == 'run':
        min_chnk = 2 if min_chnk is None else min_chnk
        dyads = 5 if dyads is None else dyads
        res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
        times = np.r_[[t_load], res['timings']]
        print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
        print('TOTAL', times.sum())
        return res

Solution 5

This is technically off-topic (not GPU) but I'm sure you'll be interested.

There is one obvious and rather large saving:

Precompute A + B + C + D (not in the loop, on the whole data: data1.sum(axis=-1)), because abcd = ((A+B+C+D) - 4Cmin) / (4dif). This will save quite a few ops.

Surprised nobody spotted that one before ;-)

Edit:

There is another thing, though I suspect that's only in your example, not in your real data:

As it stands roughly half of data2a will be smaller than data2b. In these places your conditions on abcd cannot be both True, so you needn't even compute abcd there.

Edit:

One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.

Edit:

Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.

On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it should might work.

import numpy as np
import time
import tensorflow as tf

# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations

# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32

def worker(data1, data2a, data2b):
    ###################################
    # CHANGE cpu to gpu in next line! #
    ###################################
    with tf.device('/cpu:0'):
        g = tf.Graph ()
        with g.as_default():
            ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
            B = tf.constant(data1[:, 1], dtype=ddf)
            C = tf.constant(data1[:, 2], dtype=ddf)
            window = tf.constant(len(data2a))
            N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
            data2a = tf.constant(data2a, dtype=ddf)
            data2b = tf.constant(data2b, dtype=ddf)
            def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
                # most of the time we can keep the old max/min
                Bmaxind = tf.cond(Bmaxind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmax(B[i:i+window], axis=0)),
                                  lambda: tf.cond(Bmax>B[i+window-1], 
                                                  lambda: Bmaxind, 
                                                  lambda: i+window-1))
                Cminind = tf.cond(Cminind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmin(C[i:i+window], axis=0)),
                                  lambda: tf.cond(Cmin<C[i+window-1],
                                                  lambda: Cminind,
                                                  lambda: i+window-1))
                Bmax = B[Bmaxind]
                Cmin = C[Cminind]
                abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
                out = out.write(i, tf.to_int32(
                    tf.count_nonzero(tf.logical_and(abcd <= data2a,
                                                    abcd >= data2b))))
                return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
            with tf.Session(graph=g) as sess:
                i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
                    lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
                    (tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
                     tf.Variable(-1, dtype=ddi),
                     tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
                     tf.TensorArray(ddi, size=N)),
                    shape_invariants=None,
                    parallel_iterations=parallel_iterations,
                    back_prop=False)
                out = out.pack()
                sess.run(tf.initialize_all_variables())
                out, = sess.run((out,))
    return out

#Declare variables
batchSize = 2000
sampleSize = 50000#00
resultArray = []

#Create Sample Data
data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
data2a = np.random.uniform(0, 1, (batchSize,))
data2b = np.random.uniform(0, 1, (batchSize,))

t0 = time.time()
out = worker(data1, data2a, data2b)
print('Runtime (tensorflow):', time.time() - t0)


good_indices, = np.where(out >= 490)
res_tf = np.c_[good_indices, out[good_indices]]

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B  = tmpData1[:,1]
    C   = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Loop through the data
t0 = time.time()
for rowNr in  range(sampleSize+1):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    result = doTheMath(tmp_df, data2a, data2b)
    if (result >= 490):
        resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))
Share:
12,558

Related videos on Youtube

RaduS
Author by

RaduS

Updated on September 14, 2022

Comments

  • RaduS
    RaduS over 1 year

    Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head around it but to no result. I wonder maybe somebody can help me rewrite this function in whatever way you may seem fit as log as I gives the same result at the end. I tried to use @jit from numba but for some reason it is actually much slower than running the code as usual. With a huge sample size, the goal is to decrease the execution time considerably so naturally I believe the GPU is the fastest way to do it.

    I'll explain a little what is actually happening. The real data, which looks almost identical as the sample data created in the code below is divided into sample sizes of approx 5.000.000 rows each sample or around 150MB per file. In total there are around 600.000.000 rows or 20GB of data. I must loop through this data, sample by sample and then row by row in each sample, take the last 2000 (or another) rows as of each line and run the doTheMath function which returns a result. That result is then saved back to the hardrive where I can do some other things with it with another program. As you can see below, I do not need all of the results of all the rows, only those bigger than a specific amount. If I run my function as it is right now in python I get about 62seconds per 1.000.000 rows. This is a very long time considering all the data and how fast it should be done with.

    I must mention that I upload the real data file by file to the RAM with the help of data = joblib.load(file) so uploading the data is not the problem as it takes only about 0.29 seconds per file. Once uploaded I run the entire code below. What takes the longest time is the doTheMath function. I am willing to give all of my 500 reputation points I have on stackoverflow as a reward for somebody willing to help me rewrite this simple code to run on the GPU. My interest is specifically in the GPU, I really want to see how it is done on this problem at hand.

    EDIT/UPDATE 1: Here is a link to a small sample of the real data: data_csv.zip About 102000 rows of real data1 and 2000 rows for real data2a and data2b. Use minimumLimit = 400 on the real sample data

    EDIT/UPDATE 2: For those following this post here is a short summary of the answers below. Up until now we have 4 answers to the original solution. The one offered by @Divakar are just tweaks to the original code. Of the two tweaks only the first one is actually applicable to this problem, the second one is a good tweak but does not apply here. Out of the other three answers, two of them are CPU based solutions and one tensorflow-GPU try. The Tensorflow-GPU by Paul Panzer seems to be promising but when i actually run it on the GPU it is slower than the original, so the code still needs improvement.

    The other two CPU based solutions are submitted by @PaulPanzer (a pure numpy solution) and @MSeifert (a numba solution). Both solutions give very good results and both process data extremely fast compared to the original code. Of the two the one submitted by Paul Panzer is faster. It processes about 1.000.000 rows in about 3 seconds. The only problem is with smaller batchSizes, this can be overcome by either switching to the numba solution offered by MSeifert, or even the original code after all the tweaks that have been discussed below.

    I am very happy and thankful to @PaulPanzer and @MSeifert for the work they did on their answers. Still, since this is a question about a GPU based solution, i am waiting to see if anybody is willing to give it a try on a GPU version and see how much faster the data can be processed on the GPU when compared to the current CPU solutions. If there will be no other answers outperforming @PaulPanzer's pure numpy solution then i'll accept his answer as the right one and gets the bounty :)

    EDIT/UPDATE 3: @Divakar has posted a new answer with a solution for the GPU. After my testings on real data, the speed is not even comparable to the CPU counterpart solutions. The GPU processes about 5.000.000 in about 1,5 seconds. This is incredible :) I am very excited about the GPU solution and i thank @Divakar for posting it. As well as i thank @PaulPanzer and @MSeifert for their CPU solutions :) Now my research continues with an incredible speed due to the GPU :)

    import pandas as pd
    import numpy as np
    import time
    
    def doTheMath(tmpData1, data2a, data2b):
        A = tmpData1[:, 0]
        B = tmpData1[:,1]
        C = tmpData1[:,2]
        D = tmpData1[:,3]
        Bmax = B.max()
        Cmin  = C.min()
        dif = (Bmax - Cmin)
        abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
        return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    
    #Declare variables
    batchSize = 2000
    sampleSize = 5000000
    resultArray = []
    minimumLimit = 490 #use 400 on the real sample data 
    
    #Create Random Sample Data
    data1 = np.matrix(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
    data2a = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #upper limit
    data2b = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #lower limit
    #approx. half of data2a will be smaller than data2b, but that is only in the sample data because it is randomly generated, NOT the real data. The real data2a is always higher than data2b.
    
    
    #Loop through the data
    t0 = time.time()
    for rowNr in  range(data1.shape[0]):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        if(tmp_df.shape[0] == batchSize):
            result = doTheMath(tmp_df, data2a, data2b)
            if (result >= minimumLimit):
                resultArray.append([rowNr , result])
    print('Runtime:', time.time() - t0)
    
    #Save data results
    resultArray = np.array(resultArray)
    print(resultArray[:,1].sum())
    resultArray = pd.DataFrame({'index':resultArray[:,0], 'result':resultArray[:,1]})
    resultArray.to_csv("Result Array.csv", sep=';')
    

    The PC specs I am working on:

    GTX970(4gb) video card; 
    i7-4790K CPU 4.00Ghz; 
    16GB RAM;
    a SSD drive 
    running Windows 7; 
    

    As a side question, would a second video card in SLI help on this problem?

    • talonmies
      talonmies about 7 years
      SLI is irrelevant and has nothing to do with CUDA. As for how can you convert that code -- you do it by sitting down in front of your computer and typing new CUDA kernel code into your computer. And if you want to run it on two GPUs, you also type in API code to manage running the code on two GPUs.
    • sascha
      sascha about 7 years
      You can always try numba which can try to automatically use CUDA to some extent. A better approach would be using Theano/Tensorflow's computation-graphs and implement you algorithm within their frameworks to compile it for GPUs. But yeah, in general it's about knowing CUDA and customly design your algorithm for it using the available tools like talonmies mentioned.
    • RaduS
      RaduS about 7 years
      thank you for the suggestion @sascha. i thought that Theano and Tensorflow is only for machine learning problems. I will see into numba for the moment
    • sascha
      sascha about 7 years
      @RaduS No, they are general-purpose tools for math-calculations.
    • Divakar
      Divakar about 7 years
      One biggest improvement I think would be by using an initialized output array : resultArray and then at each iteration indexing into it to update, rather than starting off with empty list and using slow append.
    • RaduS
      RaduS about 7 years
      Thank you Divakar for the tweak suggestion
    • talonmies
      talonmies about 7 years
      There isn't any meaningful CUDA programming question I can see here. Why did you re-add the CUDA tag to this question?
    • RaduS
      RaduS about 7 years
      removed it, sorry :)
  • RaduS
    RaduS about 7 years
    hi @Divakar, regarding tweak#2, i read the post you linked at and tried to implement it. but it seams that i do not get the same result. maybe i am doing something wrong. Can you have a look at it? maybe it's easier for you to spot the mistake dif = 1.0 /(Bmax - Cmin) and then abcd = ((dif * A) + ((dif * B) + (dif*C) + (dif*D)) / 4)
  • Divakar
    Divakar about 7 years
    @RaduS Well, if Bmax and Cmin are close then, Bmax - Cmin would be a small number and its reciprocal would be a big number. So, later on when arrays are multiplied by that number, we would have different numbers. So, we maybe skip that tweak.
  • RaduS
    RaduS about 7 years
    thank you for the answer Paul. I tested the code on two separate computers both with Windows installed, Python3.5 and tf 0.12.1. For some reason the tensorflow version is slower than the original, even if i activate the GPU, it is still slower than the original. Here are some statistics: Pc1 has no GPU installed: Runtime (tensorflow): 9.321721315383911 Runtime (original): 3.7472479343414307 True Pc2 with GPU installed and activated: Runtime (tensorflow): 72.36511659622192 Runtime (original): 3.5680108070373535 True
  • RaduS
    RaduS about 7 years
    I also get a warning 'WARNING:tensorflow:From C:/testings.py:61 in worker.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead.'
  • RaduS
    RaduS about 7 years
    This was just a test on the code you sent with no changes to the data or the sample size. Could it be slower because it is Windows? or because i have python 3.5 and not 3.6? Or is there another reason?
  • Paul Panzer
    Paul Panzer about 7 years
    @RaduS I'm afraid when it comes to tensorflow I'm dabbling myself. Profiling and debugging are a nightmare, as far as I can tell. Let's wait a few days. Perhaps some tf buff will pick up the threads. If not I can have another look. You could try this recipy to get an idea what makes it so slow. Sorry I can't be of more help at this point.
  • RaduS
    RaduS about 7 years
    Thank you @PaulPanzer for giving it a try. Just as a side note, I uploaded a sample data in the question edit, if you want to test on it
  • Paul Panzer
    Paul Panzer about 7 years
    @RaduS Couldn't read them. Could you upload some textfiles (csv's)? Shouldn't be too large if you zip them. Is there any structure in the data, for example are they smooth?
  • RaduS
    RaduS about 7 years
    On the real sample data use the minimum limit of 400 instead for 490 @PaulPanzer
  • RaduS
    RaduS about 7 years
    I really like your edit 1 about abcd = ((A+B+C+D) - 4Cmin) / (4dif) and yes it is true that half of data2a will be smaller than data2b, but that is only in the sample data, not the real data. In the real data data2a is always higher than data2b. These are the upper and lower limits
  • Paul Panzer
    Paul Panzer about 7 years
    "This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated ..." Done just that, will post in a few minutes. It's faast, and I'm using pure numpy.
  • MSeifert
    MSeifert about 7 years
    ok, I have to correct my previous statement because I did something wrong, it's only 30 times faster :(
  • MSeifert
    MSeifert about 7 years
    @PaulPanzer Yes, one can implement all of these functions again (instead of using scipy filters) but I think you're numpy code is quite convolved and on my computer also slower (not much, but almost 2x slower). So I don't think it's an advantage to "use pure numpy" here. Besides: Numba can actually compile code for GPUs even though I haven't done it myself. :)
  • RaduS
    RaduS about 7 years
    wow that is really impressive results, i like your approach. I see that the res_indices returns a list of all the indexes where it's above the threshold. is How do i get also the result number for each index in the same array after the where?
  • Paul Panzer
    Paul Panzer about 7 years
    Did you use actual data or just random numbers? Makes quite a difference here (x2 - x3). Btw. I'm not religious about numpy or numba or whatever, I was just pointing out that the 30x I'm seeing are mostly down to the improved algo, if you can get more on top of that using numba or whatever, all the better. Oh, and does scipy actually have a sliding argmax or even a sliding max?
  • Paul Panzer
    Paul Panzer about 7 years
    You can use res_indices directly on out out[res_indices] gives you the numbers of points that satisfied your criteria at each offset where this number was 400 or more. Could you test the script on some more of your data? I tweaked it to the sample you posted but I'd be interested to know whether it also works well on other samples.
  • RaduS
    RaduS about 7 years
    I tested now quite a lot your script on the real data and the speed is amazing as well as the accuracy is 100% correct. I get about 3 seconds per 1mil rows. This is really impressive considering that fact that it is running just on the CPU. I am really happy about the results of the script although it's a bit hard for me to understand all of it :)
  • RaduS
    RaduS about 7 years
    What means the ALL_IN = True and MAX_BLUR = 160 variables?
  • RaduS
    RaduS about 7 years
    Hi @MSeifert Thank you for submitting your answer. Now i tested your method on quite a substantial amount of real data. The results are indeed faster, not as fast as i expected from numba, but they are a faster than the original version i had. There is one problem. The accuracy of the results is totally 0%. There has to be something wrong with the calculations, hmm, i will see if i can find the problem tomorrow
  • RaduS
    RaduS about 7 years
    There is one thing i noticed, the more i decrease the window size substantially, the slower it gets which is maybe normal considering the method you are using here
  • MSeifert
    MSeifert about 7 years
    @RaduS I accidentally copied the attempt with mode and cval in the filters. I edited the answer and it should be working now. :)
  • MSeifert
    MSeifert about 7 years
    @PaulPanzer I don't think that scipy has a sliding max (or argmax) function. But I know pandas.rolling_max. But that's deprecated for numpy-arrays and also much slower than the tweaked scipy-filter.
  • Paul Panzer
    Paul Panzer about 7 years
    You mean the length of data2a/b? Is it really slower or only less faster? How small are the winows? Anyway, I've given the code a facelift. The function and variable names are chosen with more care and there are a few docstrings. ALL_IN and MAX_BLUR have been replaced with use_pre_screening, min_chnk and dyads, which are used to switch the bulk rejection code on or off and to control the chunk sizes of offsets to be pre screened. There is now a master function that supports a calibrate mode that can help you find the best parameters for example if you want to change the window size.
  • Paul Panzer
    Paul Panzer about 7 years
    @MSeifert Thanks. Any chance of your divulging what this mysterious tweaked scipy filter is?
  • Paul Panzer
    Paul Panzer about 7 years
    Thinking about it it kind of makes sense, because one of the major savings is exploiting the fact that the sliding max doesn't change too often. Now, the smaller you make the window, the less true this becomes, so while your savings go away you are still lumbered with the overhead of all that tricksy code. If you are going to very small windows other strategies might be perform better...
  • RaduS
    RaduS about 7 years
    by now i did some quite extensive testings with your script and i really like how it performs and especially i like that it gives 100% accuracy. Occasionally it throws an error "RuntimeWarning: divide by zero encountered in double_scalars * (0.25 / (u_scale-l_scale))" on line block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \ * (0.25 / (u_scale-l_scale)) and also another error RuntimeWarning: invalid value encountered in greater_equal for l in range(r)] But this has happen only a few times so i have to check the data to see maybe there something wrong there
  • RaduS
    RaduS about 7 years
    I will wait a bit more and if nobody else submits an answer for GPU that would perform better i will accept your answer and the points are yours :)
  • user7138814
    user7138814 about 7 years
    scipy has scipy.ndimage.maximum_filter1d
  • MSeifert
    MSeifert about 7 years
    @PaulPanzer It's just a sliding minimum (or maximum) that looks size-elements ahead.
  • MSeifert
    MSeifert about 7 years
    @user7138814 I already use maximum_filter1d and minimum_filter1d in the solution.
  • Paul Panzer
    Paul Panzer about 7 years
    Ok, the website is yelling at me not to be so talkative, so briefly: do these warnings occur together? The first you describe looks like Cmin and Bmax taking the same value. The second I'm not sure could be something wrong with data2b. What happens if you run the original script over the same data?
  • Paul Panzer
    Paul Panzer about 7 years
    I couldn't help tinkering a bit more. The new code fixes two small bugs and has a new sliding_argmax which on my rig at our standard 1,000,000 million sample test shaves off another half second. So we're down to 2.5sec of which 0.8sec are for loading the data!
  • user7138814
    user7138814 about 7 years
    @MSeifert Oh I didn't see it at the imports there ;-)
  • RaduS
    RaduS about 7 years
    finnally a GPU solution :) thank you @Divakar. I will test it extensively on real data later today and come back with the results
  • Divakar
    Divakar about 7 years
    @RaduS Be sure to check out the edited codes (just edited) for benchmarking! Now it accepts any arbitrary batchSize.
  • RaduS
    RaduS about 7 years
    Hi @Divakar, i did some intensive testings on a lot of real data with the script. And the speed is indeed impressive. But the results are wrong. The output is too big for any given data. The result array should only contain rows which are bigger than the minimumLimit. Here is a link link where i uploaded a sample 102000 rows of real data as well as your script running it. The original script you posted in your solution also gives the wrong result, i guess it is because you change data2a to fit the GPU function but it is not the same input for the CPU function
  • RaduS
    RaduS about 7 years
    In any case in the attached script you can also see the data which should be given to the original CPU script and the data which is given the GPU script. I do not totally understand why the GPU script returns such a big result back, it should't. Also attached in the file you will find what the output of the original CPU script should be based on the real sample data. I would appreciate a lot if you can have a look at the GPU script, maybe you can spot easily what is wrong
  • RaduS
    RaduS about 7 years
    what i meant earlier is that "The result array should only contain a few rows which are bigger than the minimumLimit not so many"
  • RaduS
    RaduS about 7 years
    I also notice that if the sampleSize of data1 is bigger than 4.500.000 rows it crashes the script and the video card. Could it be because of too much data going in? Or maybe because the out result (which is wrong) is too big as well?
  • RaduS
    RaduS about 7 years
    The error is File "C:\toolkits\anaconda3-4.2.0\lib\site-packages\pycuda\driver‌​.py", line 411, in function_call handler.post_call(stream) File "C:\toolkits\anaconda3-4.2.0\lib\site-packages\pycuda\driver‌​.py", line 96, in post_call memcpy_dtoh(self.array, self.get_device_alloc()) pycuda._driver.Error: cuMemcpyDtoH failed: unknown error
  • Divakar
    Divakar about 7 years
    @RaduS Added Clarification #1 : Issue on value mis-match section on clarifying that value mis match issue. Let me know your thoughts. On bigger data causing crash, I would think its the huge-ness of the data. So, to resolve, we might need to iterate through rather than overload data into GPU. So, we could send data to GPU in chunks. Given that we are using good capable GPUs, we could send in big chunks and thus iterate less and this won't be that bad I think.
  • RaduS
    RaduS about 7 years
    aha, thank you for clarification, i think i found the problem of too big output. Why do you add data2b to data2a? (data2a = data2b + np.random.....) Is it just a typo in your code or is there a reason for that? Because if i just use data2a and data2b without adding b to a, then i get the correct result on the GPU :) with the small max_error of 2 which i am ok with :)
  • Divakar
    Divakar about 7 years
    @RaduS Ah yes! That might be it! Well I was doing that adding part to make the upper limit array data2a always bigger than data2b because I was using randomly generated data. So, yes if you are using CSV data, remove that adding part. Edited that gist post accordingly.
  • RaduS
    RaduS about 7 years
    well in that case i accept your answer as the correct one :) THANK YOU. now i can continue my research faster since i have the script running on the GPU. The last question, you mention something about improvement on the "constant memory" Could you please help me better understand this and elaborate a bit your answer?
  • Divakar
    Divakar about 7 years
    @RaduS Sure, will do that, by tonight I think :)
  • RaduS
    RaduS about 7 years
    you are also the bounty winner :)
  • Divakar
    Divakar about 7 years
    @RaduS Thank you so much! :D
  • Divakar
    Divakar about 7 years
    @RaduS Removed that Clarification #1 : Issue on value mis-match section, as it seems the issue was simply the erroneous adding part :)
  • RaduS
    RaduS about 7 years
    for some reason i think of this video link and it makes me smile when comparing your GPU solution to the other CPU solutions. The execution speed is indeed incredible
  • Divakar
    Divakar about 7 years
    @RaduS 1,2,3, Boom!:D Ah GPU is magic! I was learning CUDA awhile back and through your bounty, gave me the motivation to get back to it, so thanks for that! So much to learn huh.
  • Paul Panzer
    Paul Panzer about 7 years
    @Divakar just dropping by to say congrats! I was half thinking of tweaking mine even more but yours is just too good.
  • Divakar
    Divakar about 7 years
    @PaulPanzer Thanks mate! :)
  • Divakar
    Divakar about 7 years
    @RaduS I think I need one more day to write out about constant memory. I also plan to include about texture memory, not a lot, but relevant to the problem at hand. Hope that's okay.
  • RaduS
    RaduS about 7 years
    no problem, i am thankful that you are taking a bit time to elaborate on this matter :)
  • Divakar
    Divakar about 7 years
    @RaduS According to this, constant memory is optimized when warp of threads read same location, which isn't happening with the program flow employed AFAI understood it. Texture memory was worth exploring, but that seems to be a good solution with 2D data. So, considering all that, it seems to me that to use those special memory, we might need to restructure the thought-process/program flow. So, that's all I got for you.
  • RaduS
    RaduS about 7 years
    Hi @Divakar, Thank you the info. I have been working with the code for a while now and i wonder how get lower values on the batch size? When i try with a batch size of 5 or 20, i get misleading results. Somethings i would need a batch size of 5 or 20, but it doesn't work. It seems that the minimum is 1024 based on the THREADS_PER_BLOCK