Python: rewrite a looping numpy math function to run on GPU

python numpy tensorflow theano numba

12,558

Solution 1

Introduction and solution code

Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.

Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.

Listing out the PyCUDA-NumPy solution code -

1] PyCUDA part -

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK

__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
    int tid = threadIdx.x;
    int inv = TBP;
    __shared__ float dS[TBP][2];

    dS[tid][0] = d1[tid+offset];  
    dS[tid][1] = d2[tid+offset];         
    __syncthreads();

    if(tid<L-TBP)  
    {
        dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
        dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
    }
    __syncthreads();
    inv = inv/2;

    while(inv!=0)   
    {
        if(tid<inv)
        {
            dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
            dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
        }
        __syncthreads();
        inv = inv/2;
    }
    __syncthreads();

    if(tid==0)
    {
        out[0] = dS[0][0];
        out[1] = dS[0][1];
    }   
    __syncthreads();
}

__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
    int L = BLOCKLEN[0];
    int tid = threadIdx.x;
    int iterID = blockIdx.x;
    float Bmax_Cmin[2];
    int inv;
    float Cmin, dif;   
    __shared__ float dS[TBP*2];   

    get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);  
    Cmin = Bmax_Cmin[1];
    dif = (Bmax_Cmin[0] - Cmin);

    inv = TBP;

    dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
    __syncthreads();

    if(tid<L-TBP)  
        dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);                   

     dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
     __syncthreads();

     if(tid<L-TBP)
         dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
     __syncthreads();

    inv = inv/2;
    while(inv!=0)   
    {
        if(tid<inv)
            dS[tid] += dS[tid+inv];
        __syncthreads();
        inv = inv/2;
    }

    if(tid==0)
        out[iterID] = dS[0];
    __syncthreads();

}
""")

Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.

2] NumPy part -

def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
    func1 = mod.get_function("main1")
    outlen = len(A)-batchSize+1

    # Set block and grid sizes
    BSZ = (1024,1,1)
    GSZ = (outlen,1)

    dest = np.zeros(outlen).astype(np.float32)
    N = np.int32(batchSize)
    func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                     drv.In(data2b), drv.In(data2a),\
                     drv.In(N), block=BSZ, grid=GSZ)
    idx = np.flatnonzero(dest >= minimumLimit)
    return idx, dest[idx]

Benchmarking

I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.

Original approach -

def org_app(data1, batchSize, minimumLimit):
    resultArray = []
    for rowNr in  range(data1.shape[0]-batchSize+1):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result]) 
    return resultArray

Timings and verification -

In [2]: #Declare variables
   ...: batchSize = 2000
   ...: sampleSize = 50000
   ...: resultArray = []
   ...: minimumLimit = 490 #use 400 on the real sample data
   ...: 
   ...: #Create Random Sample Data
   ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
   ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: 
   ...: # Make column copies
   ...: A = data1[:,0].copy()
   ...: B = data1[:,1].copy()
   ...: C = data1[:,2].copy()
   ...: D = data1[:,3].copy()
   ...: 
   ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
   ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
   ...: print(np.allclose(gpu_out1, cpu_out1))
   ...: print(np.allclose(gpu_out2, cpu_out2))
   ...: 
True
False

So, there's some differences between CPU and GPU countings. Let's investigate them -

In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))

In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])

In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1.,  1.,  1.])

There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.

Finally, puting the issue aside, let's get to the most important bit, the performance -

In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop

In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop

In [12]: 2180.0/82.5
Out[12]: 26.424242424242426

Let's try with bigger datasets. With sampleSize = 500000, we get -

In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop

In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop

In [16]: 23200.0/821
Out[16]: 28.25822168087698

So, the speedup stays constant at around 27.

Limitations :

1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.

Further improvement :

1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

Solution 2

Tweak #1

Its usually advised to vectorize things when working with NumPy arrays. But with very large arrays, I think you are out of options there. So, to boost performance, a minor tweak is possible to optimize on the last step of summing.

We could replace the step that makes an array of 1s and 0s and does summing :

np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

with np.count_nonzero that works efficiently to count True values in a boolean array, instead of converting to 1s and 0s -

np.count_nonzero((abcd <= data2a) & (abcd >= data2b))

Runtime test -

In [45]: abcd = np.random.randint(11,99,(10000))

In [46]: data2a = np.random.randint(11,99,(10000))

In [47]: data2b = np.random.randint(11,99,(10000))

In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
10000 loops, best of 3: 81.8 µs per loop

In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
10000 loops, best of 3: 28.8 µs per loop

Tweak #2

Use a pre-computed reciprocal when dealing with cases that undergo implicit broadcasting. Some more info here. Thus, store reciprocal of dif and use that instead at the step :

((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ...

Sample test -

In [52]: A = np.random.rand(10000)

In [53]: dif = 0.5

In [54]: %timeit A/dif
10000 loops, best of 3: 25.8 µs per loop

In [55]: %timeit A*(1.0/dif)
100000 loops, best of 3: 7.94 µs per loop

You have four places using division by dif. So, hopefully this would bring out noticeable boost there too!

Solution 3

Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:

abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

This creates an intermediate array each step:

tmp1 = A-Cmin,
tmp2 = tmp1 / dif,
tmp3 = B - Cmin,
tmp4 = tmp3 / dif
... you get the gist.

However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".

import numba as nb

@nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
    Bmax = np.max(tmpData[:, 1])
    Cmin = np.min(tmpData[:, 2])
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    sum_ = 0
    for i in range(tmpData.shape[0]):
        val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

I did something else here to avoid multiple operations:

(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)

This actually cuts down the execution time by almost a factor of 10 on my computer:

%timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop

There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:

import time
import numpy as np
import numba as nb

@nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    quarter_idiff = 0.25 * idiff
    sum_ = 0
    for i in range(abcd.shape[0]):
        val = abcd[i] * quarter_idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

@nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
    found = 0
    for rowNr in range(data1.shape[0]):
        if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
            result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                    data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
            if (result >= minimumLimit):
                resultArray[found, 0] = rowNr
                resultArray[found, 1] = result
                found += 1
    return resultArray[:found]

#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))  # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))

result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)

This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!

With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?

And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

Solution 4

Here is some code to demonstrate what is possible by just tweaking the algorithm. It's pure numpy but on the sample data you posted gives a roughly 35x speedup over the original version (~1,000,000 samples in ~2.5sec on my rather modest machine):

>>> result_dict = master('run')
[('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
TOTAL 2.44821691513

Tweaks used:

A+B+C+D, see my other answer
running min/max, including avoiding to compute (A+B+C+D - 4Cmin)/(4dif) multiple times with the same Cmin/dif.

These are more or less routine. That leaves the comparison with data2a/b which is expensive O(NK) where N is the number of samples and K is the size of the window. Here one can take advantage of the relatively well-behaved data. Using the running min/max one can create variants of data2a/b that can be used to test a range of window offsets at a time, if the test fails all these offsets can be ruled out immediately, otherwise the range is bisected.

import numpy as np
import time

# global variables; they will hold the precomputed pre-screening filters
preA, preB = {}, {}
CHUNK_SIZES = None

def sliding_argmax(data, K=2000):
    """compute the argmax of data over a sliding window of width K

    returns:
        indices  -- indices into data
        switches -- window offsets at which the maximum changes
                    (strictly speaking: where the index of the maximum changes)
                    excludes 0 but includes maximum offset (len(data)-K+1)

    see last line of compute_pre_screening_filter for a recipe to convert
    this representation to the vector of maxima
    """
    N = len(data)
    last = np.argmax(data[:K])
    indices = [last]
    while indices[-1] <= N - 1:
        ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
        if len(ge) == 0:
            if last + K >= N:
                break
            last += 1 + np.argmax(data[last + 1 : last + K + 1])
            indices.append(last)
        else:
            last += 1 + ge[0]
            indices.append(last)
    indices = np.array(indices)
    switches = np.where(data[indices[1:]] > data[indices[:-1]],
                        indices[1:] + (1-K), indices[:-1] + 1)
    return indices, np.r_[switches, [len(data)-K+1]]


def compute_pre_screening_filter(bound, n_offs):
    """compute pre-screening filter for point-wise upper bound

    given a K-vector of upper bounds B and K+n_offs-1-vector data
    compute K+n_offs-1-vector filter such that for each index j
    if for any offset 0 <= o < n_offs and index 0 <= i < K such that
    o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j

    therefore the number of data points below filter is an upper bound for
    the maximum number of points below bound in any K-window in data
    """
    pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
    padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
    indices, switches = sliding_argmax(padded, n_offs)
    return padded[indices].repeat(np.diff(np.r_[[0], switches]))


def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
    """compute upper and lower pre-screening filters for data blocks of
    sizes K+n_offs-1 where
    n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk

    the result is stored in global variables preA and preB
    """
    global CHUNK_SIZES

    CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
    preA[1] = upper
    preB[1] = lower
    for n in CHUNK_SIZES:
        preA[n] = compute_pre_screening_filter(upper, n)
        preB[n] = -compute_pre_screening_filter(-lower, n)


def test_bounds(block, counts, threshold=400):
    """test whether the windows fitting in the data block 'block' fall
    within the bounds using pre-screening for efficient bulk rejection

    array 'counts' will be overwritten with the counts of compliant samples
    note that accurate counts will only be returned for above threshold
    windows, because the analysis of bulk rejected windows is short-circuited

    also note that bulk rejection only works for 'well behaved' data and
    for example not on random numbers
    """
    N = len(counts)
    K = len(preA[1])
    r = N % CHUNK_SIZES[0]
    # chop up N into as large as possible chunks with matching pre computed
    # filters
    # start with small and work upwards
    counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                   (block[l:l+K] >= preB[1]))
                  for l in range(r)]

    def bisect(block, counts):
        M = len(counts)
        cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
        if cnts < threshold:
            counts[:] = cnts
            return
        elif M == CHUNK_SIZES[0]:
            counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                          (block[l:l+K] >= preB[1]))
                         for l in range(M)]
        else:
            M //= 2
            bisect(block[:-M], counts[:M])
            bisect(block[M:], counts[M:])

    N = N // CHUNK_SIZES[0]
    for M in CHUNK_SIZES:
        if N % 2:
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M
        elif N == 0:
            return
        N //= 2
    else:
        for j in range(2*N):
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M


def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
            threshold=400):
    samples, upper, lower = data
    N, K = samples.shape[0], upper.shape[0]
    times = [time.time()]
    if use_pre_screening:
        compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
    times.append(time.time())
    # compute switching points of max and min for running normalisation
    upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
    lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
    times.append(time.time())
    # sum columns
    ABCD = samples.sum(axis=-1)
    times.append(time.time())
    counts = np.empty((N-K+1,), dtype=int)
    # main loop
    # loop variables:
    offs = 0
    u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
    l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
    while True:
        # check which is switching next, min(C) or max(B)
        if u_swp > l_swp:
            # greedily take the largest block possible such that dif and Cmin
            # do not change
            block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:l_swp], threshold=threshold)
            else:
                counts[offs:l_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(l_swp - offs)]
            # book keeping
            l_ind += 1
            offs = l_swp
            l_swp = lower_swp[l_ind]
            l_scale = samples[lower_inds[l_ind], 2]
        else:
            block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:u_swp], threshold=threshold)
            else:
                counts[offs:u_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(u_swp - offs)]
            u_ind += 1
            if u_ind == len(upper_inds):
                assert u_swp == N-K+1
                break
            offs = u_swp
            u_swp = upper_swp[u_ind]
            u_scale = samples[upper_inds[u_ind], 1]
    times.append(time.time())
    return {'counts': counts, 'valid': np.where(counts >= 400)[0],
            'timings': np.diff(times)}


def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
           min_chnk=None, dyads=None):
    t = time.time()
    if data in ('fake', 'load'):
        data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
                           usecols=[1,2,3,4])
        data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        if data == 'fake':
            data1 = np.tile(data1, (10, 1))
        threshold = 400
    elif data == 'random':
        data1 = np.random.random((102000, 4))
        data2b = np.random.random(2000)
        data2a = np.random.random(2000)
        threshold = 490
        if use_pre_screening or mode == 'calibrate':
            print('WARNING: pre-screening not efficient on artificial data')
    else:
        raise ValueError("data mode {} not recognised".format(data))
    data = data1, data2a, data2b
    t_load = time.time() - t
    if mode == 'calibrate':
        min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
        dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
        timings = np.zeros((len(min_chnk), len(dyads)))
        print('max bisect  ' + ' '.join([
            '   n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
              end='')
        for i, mc in enumerate(min_chnk):
            print('\nmin chunk {}'.format(mc), end=' ')
            for j, dy in enumerate(dyads):
                for k in range(nrep):
                    if dy == 0: # no pre-screening
                        timings[i, j] += analyse(
                            data, False, mc, dy, threshold)['timings'][3]
                    else:
                        timings[i, j] += analyse(
                            data, True, mc, dy, threshold)['timings'][3]
                timings[i, j] /= nrep
                print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
        best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
                                            timings.shape)
        print('\nbest', min_chnk[best_mc], dyads[best_dy])
        return timings, min_chnk[best_mc], dyads[best_dy]
    if mode == 'run':
        min_chnk = 2 if min_chnk is None else min_chnk
        dyads = 5 if dyads is None else dyads
        res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
        times = np.r_[[t_load], res['timings']]
        print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
        print('TOTAL', times.sum())
        return res

Solution 5

~~This is technically off-topic (not GPU) but I'm sure you'll be interested.~~

There is one obvious and rather large saving:

Precompute A + B + C + D (not in the loop, on the whole data: data1.sum(axis=-1)), because abcd = ((A+B+C+D) - 4Cmin) / (4dif). This will save quite a few ops.

Surprised nobody spotted that one before ;-)

Edit:

There is another thing, though I suspect that's only in your example, not in your real data:

As it stands roughly half of data2a will be smaller than data2b. In these places your conditions on abcd cannot be both True, so you needn't even compute abcd there.

Edit:

One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.

Edit:

Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.

On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it ~~should~~ might work.

import numpy as np
import time
import tensorflow as tf

# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations

# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32

def worker(data1, data2a, data2b):
    ###################################
    # CHANGE cpu to gpu in next line! #
    ###################################
    with tf.device('/cpu:0'):
        g = tf.Graph ()
        with g.as_default():
            ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
            B = tf.constant(data1[:, 1], dtype=ddf)
            C = tf.constant(data1[:, 2], dtype=ddf)
            window = tf.constant(len(data2a))
            N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
            data2a = tf.constant(data2a, dtype=ddf)
            data2b = tf.constant(data2b, dtype=ddf)
            def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
                # most of the time we can keep the old max/min
                Bmaxind = tf.cond(Bmaxind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmax(B[i:i+window], axis=0)),
                                  lambda: tf.cond(Bmax>B[i+window-1], 
                                                  lambda: Bmaxind, 
                                                  lambda: i+window-1))
                Cminind = tf.cond(Cminind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmin(C[i:i+window], axis=0)),
                                  lambda: tf.cond(Cmin<C[i+window-1],
                                                  lambda: Cminind,
                                                  lambda: i+window-1))
                Bmax = B[Bmaxind]
                Cmin = C[Cminind]
                abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
                out = out.write(i, tf.to_int32(
                    tf.count_nonzero(tf.logical_and(abcd <= data2a,
                                                    abcd >= data2b))))
                return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
            with tf.Session(graph=g) as sess:
                i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
                    lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
                    (tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
                     tf.Variable(-1, dtype=ddi),
                     tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
                     tf.TensorArray(ddi, size=N)),
                    shape_invariants=None,
                    parallel_iterations=parallel_iterations,
                    back_prop=False)
                out = out.pack()
                sess.run(tf.initialize_all_variables())
                out, = sess.run((out,))
    return out

#Declare variables
batchSize = 2000
sampleSize = 50000#00
resultArray = []

#Create Sample Data
data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
data2a = np.random.uniform(0, 1, (batchSize,))
data2b = np.random.uniform(0, 1, (batchSize,))

t0 = time.time()
out = worker(data1, data2a, data2b)
print('Runtime (tensorflow):', time.time() - t0)


good_indices, = np.where(out >= 490)
res_tf = np.c_[good_indices, out[good_indices]]

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B  = tmpData1[:,1]
    C   = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Loop through the data
t0 = time.time()
for rowNr in  range(sampleSize+1):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    result = doTheMath(tmp_df, data2a, data2b)
    if (result >= 490):
        resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))

View more solutions

12,558

RaduS

Updated on September 14, 2022

Comments

RaduS over 1 year
Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head around it but to no result. I wonder maybe somebody can help me rewrite this function in whatever way you may seem fit as log as I gives the same result at the end. I tried to use @jit from numba but for some reason it is actually much slower than running the code as usual. With a huge sample size, the goal is to decrease the execution time considerably so naturally I believe the GPU is the fastest way to do it.

I'll explain a little what is actually happening. The real data, which looks almost identical as the sample data created in the code below is divided into sample sizes of approx 5.000.000 rows each sample or around 150MB per file. In total there are around 600.000.000 rows or 20GB of data. I must loop through this data, sample by sample and then row by row in each sample, take the last 2000 (or another) rows as of each line and run the doTheMath function which returns a result. That result is then saved back to the hardrive where I can do some other things with it with another program. As you can see below, I do not need all of the results of all the rows, only those bigger than a specific amount. If I run my function as it is right now in python I get about 62seconds per 1.000.000 rows. This is a very long time considering all the data and how fast it should be done with.

I must mention that I upload the real data file by file to the RAM with the help of data = joblib.load(file) so uploading the data is not the problem as it takes only about 0.29 seconds per file. Once uploaded I run the entire code below. What takes the longest time is the doTheMath function. I am willing to give all of my 500 reputation points I have on stackoverflow as a reward for somebody willing to help me rewrite this simple code to run on the GPU. My interest is specifically in the GPU, I really want to see how it is done on this problem at hand.

EDIT/UPDATE 1: Here is a link to a small sample of the real data: data_csv.zip About 102000 rows of real data1 and 2000 rows for real data2a and data2b. Use minimumLimit = 400 on the real sample data

EDIT/UPDATE 2: For those following this post here is a short summary of the answers below. Up until now we have 4 answers to the original solution. The one offered by @Divakar are just tweaks to the original code. Of the two tweaks only the first one is actually applicable to this problem, the second one is a good tweak but does not apply here. Out of the other three answers, two of them are CPU based solutions and one tensorflow-GPU try. The Tensorflow-GPU by Paul Panzer seems to be promising but when i actually run it on the GPU it is slower than the original, so the code still needs improvement.

The other two CPU based solutions are submitted by @PaulPanzer (a pure numpy solution) and @MSeifert (a numba solution). Both solutions give very good results and both process data extremely fast compared to the original code. Of the two the one submitted by Paul Panzer is faster. It processes about 1.000.000 rows in about 3 seconds. The only problem is with smaller batchSizes, this can be overcome by either switching to the numba solution offered by MSeifert, or even the original code after all the tweaks that have been discussed below.

I am very happy and thankful to @PaulPanzer and @MSeifert for the work they did on their answers. Still, since this is a question about a GPU based solution, i am waiting to see if anybody is willing to give it a try on a GPU version and see how much faster the data can be processed on the GPU when compared to the current CPU solutions. If there will be no other answers outperforming @PaulPanzer's pure numpy solution then i'll accept his answer as the right one and gets the bounty :)

EDIT/UPDATE 3: @Divakar has posted a new answer with a solution for the GPU. After my testings on real data, the speed is not even comparable to the CPU counterpart solutions. The GPU processes about 5.000.000 in about 1,5 seconds. This is incredible :) I am very excited about the GPU solution and i thank @Divakar for posting it. As well as i thank @PaulPanzer and @MSeifert for their CPU solutions :) Now my research continues with an incredible speed due to the GPU :)
```
import pandas as pd
import numpy as np
import time

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B = tmpData1[:,1]
    C = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Declare variables
batchSize = 2000
sampleSize = 5000000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

#Create Random Sample Data
data1 = np.matrix(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #upper limit
data2b = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #lower limit
#approx. half of data2a will be smaller than data2b, but that is only in the sample data because it is randomly generated, NOT the real data. The real data2a is always higher than data2b.


#Loop through the data
t0 = time.time()
for rowNr in  range(data1.shape[0]):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    if(tmp_df.shape[0] == batchSize):
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result])
print('Runtime:', time.time() - t0)

#Save data results
resultArray = np.array(resultArray)
print(resultArray[:,1].sum())
resultArray = pd.DataFrame({'index':resultArray[:,0], 'result':resultArray[:,1]})
resultArray.to_csv("Result Array.csv", sep=';')
```
The PC specs I am working on:
```
GTX970(4gb) video card; 
i7-4790K CPU 4.00Ghz; 
16GB RAM;
a SSD drive 
running Windows 7; 
```
As a side question, would a second video card in SLI help on this problem?
- talonmies about 7 years
  
  SLI is irrelevant and has nothing to do with CUDA. As for how can you convert that code -- you do it by sitting down in front of your computer and typing new CUDA kernel code into your computer. And if you want to run it on two GPUs, you also type in API code to manage running the code on two GPUs.
- sascha about 7 years
  
  You can always try numba which can try to automatically use CUDA to some extent. A better approach would be using Theano/Tensorflow's computation-graphs and implement you algorithm within their frameworks to compile it for GPUs. But yeah, in general it's about knowing CUDA and customly design your algorithm for it using the available tools like talonmies mentioned.
- RaduS about 7 years
  
  thank you for the suggestion @sascha. i thought that Theano and Tensorflow is only for machine learning problems. I will see into numba for the moment
- sascha about 7 years
  
  @RaduS No, they are general-purpose tools for math-calculations.
- Divakar about 7 years
  
  One biggest improvement I think would be by using an initialized output array : resultArray and then at each iteration indexing into it to update, rather than starting off with empty list and using slow append.
- RaduS about 7 years
  
  Thank you Divakar for the tweak suggestion
- talonmies about 7 years
  
  There isn't any meaningful CUDA programming question I can see here. Why did you re-add the CUDA tag to this question?
- RaduS about 7 years
  
  removed it, sorry :)
RaduS about 7 years

hi @Divakar, regarding tweak#2, i read the post you linked at and tried to implement it. but it seams that i do not get the same result. maybe i am doing something wrong. Can you have a look at it? maybe it's easier for you to spot the mistake dif = 1.0 /(Bmax - Cmin) and then abcd = ((dif * A) + ((dif * B) + (dif*C) + (dif*D)) / 4)
Divakar about 7 years

@RaduS Well, if Bmax and Cmin are close then, Bmax - Cmin would be a small number and its reciprocal would be a big number. So, later on when arrays are multiplied by that number, we would have different numbers. So, we maybe skip that tweak.
RaduS about 7 years

thank you for the answer Paul. I tested the code on two separate computers both with Windows installed, Python3.5 and tf 0.12.1. For some reason the tensorflow version is slower than the original, even if i activate the GPU, it is still slower than the original. Here are some statistics: Pc1 has no GPU installed: Runtime (tensorflow): 9.321721315383911 Runtime (original): 3.7472479343414307 True Pc2 with GPU installed and activated: Runtime (tensorflow): 72.36511659622192 Runtime (original): 3.5680108070373535 True
RaduS about 7 years

I also get a warning 'WARNING:tensorflow:From C:/testings.py:61 in worker.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead.'
RaduS about 7 years

This was just a test on the code you sent with no changes to the data or the sample size. Could it be slower because it is Windows? or because i have python 3.5 and not 3.6? Or is there another reason?
Paul Panzer about 7 years

@RaduS I'm afraid when it comes to tensorflow I'm dabbling myself. Profiling and debugging are a nightmare, as far as I can tell. Let's wait a few days. Perhaps some tf buff will pick up the threads. If not I can have another look. You could try this recipy to get an idea what makes it so slow. Sorry I can't be of more help at this point.
RaduS about 7 years

Thank you @PaulPanzer for giving it a try. Just as a side note, I uploaded a sample data in the question edit, if you want to test on it
Paul Panzer about 7 years

@RaduS Couldn't read them. Could you upload some textfiles (csv's)? Shouldn't be too large if you zip them. Is there any structure in the data, for example are they smooth?
RaduS about 7 years

On the real sample data use the minimum limit of 400 instead for 490 @PaulPanzer
RaduS about 7 years

I really like your edit 1 about abcd = ((A+B+C+D) - 4Cmin) / (4dif) and yes it is true that half of data2a will be smaller than data2b, but that is only in the sample data, not the real data. In the real data data2a is always higher than data2b. These are the upper and lower limits
Paul Panzer about 7 years

"This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated ..." Done just that, will post in a few minutes. It's faast, and I'm using pure numpy.
MSeifert about 7 years

ok, I have to correct my previous statement because I did something wrong, it's only 30 times faster :(
MSeifert about 7 years

@PaulPanzer Yes, one can implement all of these functions again (instead of using scipy filters) but I think you're numpy code is quite convolved and on my computer also slower (not much, but almost 2x slower). So I don't think it's an advantage to "use pure numpy" here. Besides: Numba can actually compile code for GPUs even though I haven't done it myself. :)
RaduS about 7 years

wow that is really impressive results, i like your approach. I see that the res_indices returns a list of all the indexes where it's above the threshold. is How do i get also the result number for each index in the same array after the where?
Paul Panzer about 7 years

Did you use actual data or just random numbers? Makes quite a difference here (x2 - x3). Btw. I'm not religious about numpy or numba or whatever, I was just pointing out that the 30x I'm seeing are mostly down to the improved algo, if you can get more on top of that using numba or whatever, all the better. Oh, and does scipy actually have a sliding argmax or even a sliding max?
Paul Panzer about 7 years

You can use res_indices directly on out out[res_indices] gives you the numbers of points that satisfied your criteria at each offset where this number was 400 or more. Could you test the script on some more of your data? I tweaked it to the sample you posted but I'd be interested to know whether it also works well on other samples.
RaduS about 7 years

I tested now quite a lot your script on the real data and the speed is amazing as well as the accuracy is 100% correct. I get about 3 seconds per 1mil rows. This is really impressive considering that fact that it is running just on the CPU. I am really happy about the results of the script although it's a bit hard for me to understand all of it :)
RaduS about 7 years

What means the ALL_IN = True and MAX_BLUR = 160 variables?
RaduS about 7 years

Hi @MSeifert Thank you for submitting your answer. Now i tested your method on quite a substantial amount of real data. The results are indeed faster, not as fast as i expected from numba, but they are a faster than the original version i had. There is one problem. The accuracy of the results is totally 0%. There has to be something wrong with the calculations, hmm, i will see if i can find the problem tomorrow
RaduS about 7 years

There is one thing i noticed, the more i decrease the window size substantially, the slower it gets which is maybe normal considering the method you are using here
MSeifert about 7 years

@RaduS I accidentally copied the attempt with mode and cval in the filters. I edited the answer and it should be working now. :)
MSeifert about 7 years

@PaulPanzer I don't think that scipy has a sliding max (or argmax) function. But I know pandas.rolling_max. But that's deprecated for numpy-arrays and also much slower than the tweaked scipy-filter.
Paul Panzer about 7 years

You mean the length of data2a/b? Is it really slower or only less faster? How small are the winows? Anyway, I've given the code a facelift. The function and variable names are chosen with more care and there are a few docstrings. ALL_IN and MAX_BLUR have been replaced with use_pre_screening, min_chnk and dyads, which are used to switch the bulk rejection code on or off and to control the chunk sizes of offsets to be pre screened. There is now a master function that supports a calibrate mode that can help you find the best parameters for example if you want to change the window size.
Paul Panzer about 7 years

@MSeifert Thanks. Any chance of your divulging what this mysterious tweaked scipy filter is?
Paul Panzer about 7 years

Thinking about it it kind of makes sense, because one of the major savings is exploiting the fact that the sliding max doesn't change too often. Now, the smaller you make the window, the less true this becomes, so while your savings go away you are still lumbered with the overhead of all that tricksy code. If you are going to very small windows other strategies might be perform better...
RaduS about 7 years

by now i did some quite extensive testings with your script and i really like how it performs and especially i like that it gives 100% accuracy. Occasionally it throws an error "RuntimeWarning: divide by zero encountered in double_scalars * (0.25 / (u_scale-l_scale))" on line block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \ * (0.25 / (u_scale-l_scale)) and also another error RuntimeWarning: invalid value encountered in greater_equal for l in range(r)] But this has happen only a few times so i have to check the data to see maybe there something wrong there
RaduS about 7 years

I will wait a bit more and if nobody else submits an answer for GPU that would perform better i will accept your answer and the points are yours :)
user7138814 about 7 years

scipy has scipy.ndimage.maximum_filter1d
MSeifert about 7 years

@PaulPanzer It's just a sliding minimum (or maximum) that looks size-elements ahead.
MSeifert about 7 years

@user7138814 I already use maximum_filter1d and minimum_filter1d in the solution.
Paul Panzer about 7 years

Ok, the website is yelling at me not to be so talkative, so briefly: do these warnings occur together? The first you describe looks like Cmin and Bmax taking the same value. The second I'm not sure could be something wrong with data2b. What happens if you run the original script over the same data?
Paul Panzer about 7 years

I couldn't help tinkering a bit more. The new code fixes two small bugs and has a new sliding_argmax which on my rig at our standard 1,000,000 million sample test shaves off another half second. So we're down to 2.5sec of which 0.8sec are for loading the data!
user7138814 about 7 years

@MSeifert Oh I didn't see it at the imports there ;-)
RaduS about 7 years

finnally a GPU solution :) thank you @Divakar. I will test it extensively on real data later today and come back with the results
Divakar about 7 years

@RaduS Be sure to check out the edited codes (just edited) for benchmarking! Now it accepts any arbitrary batchSize.
RaduS about 7 years

Hi @Divakar, i did some intensive testings on a lot of real data with the script. And the speed is indeed impressive. But the results are wrong. The output is too big for any given data. The result array should only contain rows which are bigger than the minimumLimit. Here is a link link where i uploaded a sample 102000 rows of real data as well as your script running it. The original script you posted in your solution also gives the wrong result, i guess it is because you change data2a to fit the GPU function but it is not the same input for the CPU function
RaduS about 7 years

In any case in the attached script you can also see the data which should be given to the original CPU script and the data which is given the GPU script. I do not totally understand why the GPU script returns such a big result back, it should't. Also attached in the file you will find what the output of the original CPU script should be based on the real sample data. I would appreciate a lot if you can have a look at the GPU script, maybe you can spot easily what is wrong
RaduS about 7 years

what i meant earlier is that "The result array should only contain a few rows which are bigger than the minimumLimit not so many"
RaduS about 7 years

I also notice that if the sampleSize of data1 is bigger than 4.500.000 rows it crashes the script and the video card. Could it be because of too much data going in? Or maybe because the out result (which is wrong) is too big as well?
RaduS about 7 years

The error is File "C:\toolkits\anaconda3-4.2.0\lib\site-packages\pycuda\driver‌.py", line 411, in function_call handler.post_call(stream) File "C:\toolkits\anaconda3-4.2.0\lib\site-packages\pycuda\driver‌.py", line 96, in post_call memcpy_dtoh(self.array, self.get_device_alloc()) pycuda._driver.Error: cuMemcpyDtoH failed: unknown error
Divakar about 7 years

@RaduS Added Clarification #1 : Issue on value mis-match section on clarifying that value mis match issue. Let me know your thoughts. On bigger data causing crash, I would think its the huge-ness of the data. So, to resolve, we might need to iterate through rather than overload data into GPU. So, we could send data to GPU in chunks. Given that we are using good capable GPUs, we could send in big chunks and thus iterate less and this won't be that bad I think.
RaduS about 7 years

aha, thank you for clarification, i think i found the problem of too big output. Why do you add data2b to data2a? (data2a = data2b + np.random.....) Is it just a typo in your code or is there a reason for that? Because if i just use data2a and data2b without adding b to a, then i get the correct result on the GPU :) with the small max_error of 2 which i am ok with :)
Divakar about 7 years

@RaduS Ah yes! That might be it! Well I was doing that adding part to make the upper limit array data2a always bigger than data2b because I was using randomly generated data. So, yes if you are using CSV data, remove that adding part. Edited that gist post accordingly.
RaduS about 7 years

well in that case i accept your answer as the correct one :) THANK YOU. now i can continue my research faster since i have the script running on the GPU. The last question, you mention something about improvement on the "constant memory" Could you please help me better understand this and elaborate a bit your answer?
Divakar about 7 years

@RaduS Sure, will do that, by tonight I think :)
RaduS about 7 years

you are also the bounty winner :)
Divakar about 7 years

@RaduS Thank you so much! :D
Divakar about 7 years

@RaduS Removed that Clarification #1 : Issue on value mis-match section, as it seems the issue was simply the erroneous adding part :)
RaduS about 7 years

for some reason i think of this video link and it makes me smile when comparing your GPU solution to the other CPU solutions. The execution speed is indeed incredible
Divakar about 7 years

@RaduS 1,2,3, Boom!:D Ah GPU is magic! I was learning CUDA awhile back and through your bounty, gave me the motivation to get back to it, so thanks for that! So much to learn huh.
Paul Panzer about 7 years

@Divakar just dropping by to say congrats! I was half thinking of tweaking mine even more but yours is just too good.
Divakar about 7 years

@PaulPanzer Thanks mate! :)
Divakar about 7 years

@RaduS I think I need one more day to write out about constant memory. I also plan to include about texture memory, not a lot, but relevant to the problem at hand. Hope that's okay.
RaduS about 7 years

no problem, i am thankful that you are taking a bit time to elaborate on this matter :)
Divakar about 7 years

@RaduS According to this, constant memory is optimized when warp of threads read same location, which isn't happening with the program flow employed AFAI understood it. Texture memory was worth exploring, but that seems to be a good solution with 2D data. So, considering all that, it seems to me that to use those special memory, we might need to restructure the thought-process/program flow. So, that's all I got for you.
RaduS about 7 years

Hi @Divakar, Thank you the info. I have been working with the code for a while now and i wonder how get lower values on the batch size? When i try with a batch size of 5 or 20, i get misleading results. Somethings i would need a batch size of 5 or 20, but it doesn't work. It seems that the minimum is 1024 based on the THREADS_PER_BLOCK