CUDA Architecture

Streaming Multiprocessor (SM)

A Streaming Multiprocessor (SM) is a fundamental component of NVIDIA GPUs, consisting of multiple Stream Processors (CUDA Core) responsible for executing instructions in parallel.

They are general purpose processors with a low clock rate target and a small cache.

Consists of:

Task of SM

SMs execute several thread blocks in parallel. As soon as one of its thread block has completed execution, it takes up the serially next thread block.

From Stephen Jones, I learned that each SM can managed 64 warps, so a total of 2048 threads. However, it really processes 4 warps at a time (see Warp Scheduling).

Resources

How many thread blocks at the same time?

An SM may contain up to 8 thread blocks in total.

Branch prediction?

In general, SMs support instruction-level parallelism but not branch prediction.

Each architecture in GPU consists of several SM.