Memory Alignment
https://stackoverflow.com/questions/1063809/aligned-and-unaligned-memory-accesses
I still have a lot of trouble understanding this.
Also see Linearized Array.
Byte Aligned vs. Word Aligned vs. Memory Aligned
Are these all the same?
There is also DRAM Bursting.
CUDA
Understanding how memory is aligned will be fundamental to getting CUDA to run so much faster.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
“When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads”
CUDA Memory Alignment
Also see CUDA Memory.
CudaMallocPitch
Guarantees alignment with 2D arrays.
Source: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}