Misaligned address in CUDA

cuda alignment gpu gpgpu nvidia

16,731

What the error message means is that the pointer is not aligned to the boundary required by the processor.

From the CUDA Programming Guide, section 5.3.2:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

This is what the debugger is trying to tell you: Basically, you shouldn't dereference a pointer pointing to a 32-bit value from an address not aligned at a 32-bit boundary.

You can do (U32*)(sh_MT) and (U32*)(sh_MT+4) just fine, but not (U32*)(sh_MT+3) or such.

You probably have to read the bytes separately and join them together.

16,731

Rezaeimh7

I am always highly thinking of good programmers, researchers, scientists and the other professionals in computer science and want to be all of them at the same time! I love all of them and this extreme love caused me not to be a professional at all in one special aspect and was never able to choose my way in this big world. Although, I don't want to be such anymore, I am doing my best to be more and more extreme Lover.

Updated on September 20, 2022

Comments

Rezaeimh7 over 1 year

Can anyone tell me whats wrong with the following code inside a CUDA kernel:

__constant__ unsigned char MT[256] = {
    0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....};

typedef unsinged int U32;

__global__ void Kernel (unsigned int  *PT, unsigned int  *CT, unsigned int  *rk)
{

    long int i;
    __shared__ unsigned char sh_MT[256];    

    for (i = 0; i < 64; i += 4)
        ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i];

    __shared__ unsigned int sh_rkey[4];
    __shared__ unsigned int sh_state_pl[4];
    __shared__ unsigned int sh_state_ct[4];

    sh_state_pl[threadIdx.x] = PT[threadIdx.x];
    sh_rkey[threadIdx.x] = rk[threadIdx.x];
    __syncthreads();


    sh_state_ct[threadIdx.x] = ((U32*)sh_MT)[sh_state_pl[threadIdx.x]]^\
    ((U32*)(sh_MT+3))[((sh_state_pl[(1 + threadIdx.x) % 4] >> 8) & 0xff)] ^ \
    ((U32*)(sh_MT+2))[((sh_state_pl[(2 + threadIdx.x) % 4] >> 16) & 0xff)] ^\
    ((U32*)(sh_MT+1))[((sh_state_pl[(3 + threadIdx.x) % 4] >> 24) & 0xff )];


    CT[threadIdx.x] = sh_state_ct[threadIdx.x];
}

At This line of code ,

((U32*)(sh_MT+3))......

The CUDA debugger gives me the error message : misaligned address

How can I fix this error?

I am using CUDA 7 in MVSC and i use 1 Block and 4 threads for executing the Kernel Function as follow:

__device__ unsigned int *state;
__device__ unsigned int *key;
__device__ unsigned int *ct;
.
.
main()
{
cudaMalloc((void**)&state, 16);
cudaMalloc((void**)&ct, 16);
cudaMalloc((void**)&key, 16);
//cudamemcpy(copy some values to => state , ct, key);   
Kernel << <1, 4 >> >(state, ct, key); 
}

Remember please, I can't change my "MT Table" type. Thanks in advance for any advice or answer .

CherryDT almost 8 years

As the error message tells you, the pointer is not aligned to the boundary required by the processor. Basically, you can't dereference a 32-bit pointer from an address not aligned at a 32-bit boundary. What it means: you can do (U32*)(sh_MT) and (U32*)(sh_MT+4) but not (U32*)(sh_MT+3) or such. You probably have to read the bytes separately and join them together.

CygnusX1 almost 8 years

Um.... sh_MT points to shared memory, not global, but similar rules may still apply. Also note, that since you are creating a char[256] in shmem, its starting address does not have to be aligned to anything more than 1 byte. As a result, sh_MT+3 may or may not be actually aligned to 4 bytes and it is entirely compiler/hardware specific.
CherryDT almost 8 years

Hm you are right about global vs shared... But about the size : the OP converts it to a U32* before dereferencing
CygnusX1 almost 8 years

Let me rephrase: sm_MT is of type unsigned char[256]. As such it may be given an address 0x4 or 0x6 or 0x7. Anything is possible, because it has no alignment requirements. If, say, 0x7 is its address, then sh_MT+1 will work fine for dereferencing an U32 type, but sh_MMT+4 will not. I may add - such values are not uncommon! On GTX 200-series, first 6 bytes were used to store blockIdx (as unsigned short) and 7-th byte (address 0x6) was assigned to first shared variable, as long as alignment permitted. Newer devices may or may not organize differently
CherryDT almost 8 years

Yes, right, I assumed it was aligned. OK then one first has to figure out the right alignment.