Memory Alignment

I still have a lot of trouble understanding this.

Also see Linearized Array.

Byte Aligned vs. Word Aligned vs. Memory Aligned

Are these all the same?

There is also DRAM Bursting.


Understanding how memory is aligned will be fundamental to getting CUDA to run so much faster.

“When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads”

CUDA Memory Alignment

Also see CUDA Memory.


Guarantees alignment with 2D arrays.


// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];