CUDA

CUDA Hardware

How does CUDA actually work under the hood?

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors.

When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a SIMT Architecture.

Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.

The NVIDIA GPU architecture uses a Little Endian representation.