Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

optimization cuda

10,187

Solution 1

If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.

In other words if you have:

A[0]  <- mat1 element1
A[1]  <- mat1 element2
A[2]  <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18

And you access this using:

float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];

Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.

In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.

When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.

The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.

Solution 2

Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.

If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like

__shared__ float sixteen[16][16+1]

which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.

For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.

__shared__ float eighteens[18][18+1]

So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.

The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.

10,187

Author by

zenna

Electronic Engineer, Biomedical Engineer, C++/CUDA

Updated on June 04, 2022

Comments

zenna almost 2 years
As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).

At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
```
__shared__ float A[34*N]
```
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.

The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.

Is this true, if so how can I avoid it?

Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
```
__shared__ Apair1[34]
__shared__ Apair2[34]
...
```
Any ideas?

Thanks