CUDA Hardware
How does CUDA actually work under the hood?
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors.
Analogy to CPU
If we try to compare CUDA architecture to CPU architecture:
- A Streaming Multiprocessor (SM) is like a CPU core (since it contains schedulers, registers, caches, etc.).
- A CUDA Warp is like a CPU’s vectorized execution unit (like AVX, SSE, or ARM Neon).
- But CUDA warps are logical
- CUDA Core are like ALUs within that vector unit.
Big differences that got me confused
- A CPU core is a physical execution unit; a warp is a logical group of threads
- CPU cores have private registers; A warp’s threads share a pool of registers from the SM
- GPUs use this design to improve utilization and hide latency
When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a SIMT Architecture.
Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.
The NVIDIA GPU architecture uses a Little Endian representation.