Misaligned address in CUDA
What the error message means is that the pointer is not aligned to the boundary required by the processor.
From the CUDA Programming Guide, section 5.3.2:
Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).
This is what the debugger is trying to tell you: Basically, you shouldn't dereference a pointer pointing to a 32-bit value from an address not aligned at a 32-bit boundary.
You can do (U32*)(sh_MT)
and (U32*)(sh_MT+4)
just fine, but not (U32*)(sh_MT+3)
or such.
You probably have to read the bytes separately and join them together.
Related videos on Youtube
Rezaeimh7
I am always highly thinking of good programmers, researchers, scientists and the other professionals in computer science and want to be all of them at the same time! I love all of them and this extreme love caused me not to be a professional at all in one special aspect and was never able to choose my way in this big world. Although, I don't want to be such anymore, I am doing my best to be more and more extreme Lover.
Updated on September 20, 2022Comments
-
Rezaeimh7 over 1 year
Can anyone tell me whats wrong with the following code inside a CUDA kernel:
__constant__ unsigned char MT[256] = { 0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....}; typedef unsinged int U32; __global__ void Kernel (unsigned int *PT, unsigned int *CT, unsigned int *rk) { long int i; __shared__ unsigned char sh_MT[256]; for (i = 0; i < 64; i += 4) ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i]; __shared__ unsigned int sh_rkey[4]; __shared__ unsigned int sh_state_pl[4]; __shared__ unsigned int sh_state_ct[4]; sh_state_pl[threadIdx.x] = PT[threadIdx.x]; sh_rkey[threadIdx.x] = rk[threadIdx.x]; __syncthreads(); sh_state_ct[threadIdx.x] = ((U32*)sh_MT)[sh_state_pl[threadIdx.x]]^\ ((U32*)(sh_MT+3))[((sh_state_pl[(1 + threadIdx.x) % 4] >> 8) & 0xff)] ^ \ ((U32*)(sh_MT+2))[((sh_state_pl[(2 + threadIdx.x) % 4] >> 16) & 0xff)] ^\ ((U32*)(sh_MT+1))[((sh_state_pl[(3 + threadIdx.x) % 4] >> 24) & 0xff )]; CT[threadIdx.x] = sh_state_ct[threadIdx.x]; }
At This line of code ,
((U32*)(sh_MT+3))......
The CUDA debugger gives me the error message : misaligned address
How can I fix this error?
I am using CUDA 7 in MVSC and i use 1 Block and 4 threads for executing the Kernel Function as follow:
__device__ unsigned int *state; __device__ unsigned int *key; __device__ unsigned int *ct; . . main() { cudaMalloc((void**)&state, 16); cudaMalloc((void**)&ct, 16); cudaMalloc((void**)&key, 16); //cudamemcpy(copy some values to => state , ct, key); Kernel << <1, 4 >> >(state, ct, key); }
Remember please, I can't change my "MT Table" type. Thanks in advance for any advice or answer .
-
CherryDT almost 8 yearsAs the error message tells you, the pointer is not aligned to the boundary required by the processor. Basically, you can't dereference a 32-bit pointer from an address not aligned at a 32-bit boundary. What it means: you can do
(U32*)(sh_MT)
and(U32*)(sh_MT+4)
but not(U32*)(sh_MT+3)
or such. You probably have to read the bytes separately and join them together.
-
-
CygnusX1 almost 8 yearsUm....
sh_MT
points to shared memory, not global, but similar rules may still apply. Also note, that since you are creating achar[256]
in shmem, its starting address does not have to be aligned to anything more than 1 byte. As a result,sh_MT+3
may or may not be actually aligned to 4 bytes and it is entirely compiler/hardware specific. -
CherryDT almost 8 yearsHm you are right about global vs shared... But about the size : the OP converts it to a
U32*
before dereferencing -
CygnusX1 almost 8 yearsLet me rephrase:
sm_MT
is of typeunsigned char[256]
. As such it may be given an address0x4
or0x6
or0x7
. Anything is possible, because it has no alignment requirements. If, say,0x7
is its address, thensh_MT+1
will work fine for dereferencing anU32
type, butsh_MMT+4
will not. I may add - such values are not uncommon! On GTX 200-series, first 6 bytes were used to store blockIdx (asunsigned short
) and 7-th byte (address0x6
) was assigned to first shared variable, as long as alignment permitted. Newer devices may or may not organize differently -
CherryDT almost 8 yearsYes, right, I assumed it was aligned. OK then one first has to figure out the right alignment.