CUDA Memory

CUDA Shared Memory

Learned about this through the PMPP book.

CUDA Shared Variables

Shared variables in CUDA are stored in the shared memory.

This is super fundamental, motivating example is Matrix Multiplication.

Fundamental for you to understand so you can make things run fast.

Resources

Limit on Shared Memory

CUDA shared memory has size limits for each thread block which is 48 KB by default.

If a variable declaration is preceded by the __shared__ keyword, it is declares a shared variable in CUDA.

Shared variables reside in shared memory.

Shared Variable Scope

The scope of a shared variable is within a thread block.

Shared memory can be thought of like the L1/L2 cache in CPU.

How is shared memory implemented?

Shared memory is the L1 SRAM on each SM, used as an explicit scratchpad. The same physical SRAM also serves as the L1 cache, partitioned per-kernel (MIT 6.S894 Lec 4, slides 17-24). On a reference GPU that is 128 KB per SM, 128 bytes/cycle per SM, one warp-wide load/store every 4 cycles, ~9.6 TB/s aggregate across 48 SMs. Each block sees only its own scratchpad, which is why thread blocks are the grouping that can share through it.