CUDA Hardware
How does CUDA actually work under the hood?
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors.
When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a SIMT Architecture.
Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.
The NVIDIA GPU architecture uses a Little Endian representation.