CUDA

CUDA Hardware

How does CUDA actually work under the hood?

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors.

Analogy to CPU

If we try to compare CUDA architecture to CPU architecture:

  • A Streaming Multiprocessor (SM) is like a CPU core (since it contains schedulers, registers, caches, etc.).
  • A CUDA Warp is like a CPU’s vectorized execution unit (like AVX, SSE, or ARM Neon).
    • But CUDA warps are logical
  • CUDA Core are like ALUs within that vector unit.

Big differences that got me confused

  • A CPU core is a physical execution unit; a warp is a logical group of threads
  • CPU cores have private registers; A warp’s threads share a pool of registers from the SM
  • GPUs use this design to improve utilization and hide latency

When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a SIMT Architecture.

Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.

The NVIDIA GPU architecture uses a Little Endian representation.