CUDA Architecture

Streaming Multiprocessor (SM)

A Streaming Multiprocessor (SM) is a fundamental component of NVIDIA GPUs, consisting of multiple Stream Processors (CUDA Core) responsible for executing instructions in parallel.

These are general purpose processors with a low clock rate target and a small cache.

Task of SM

SMs execute several thread blocks in parallel. As soon as one of its thread block has completed execution, it takes up the serially next thread block.

From Stephen Jones, I learned that each SM can managed 64 warps, so a total of 2048 threads. However, it really processes 4 warps at a time.

To achieve this purpose, an SM contains the following:

  1. Execution cores (single precision floating-point units, double precision floating-point units, special function units (SFUs)).
  2. Caches
    1. L1 cache (for reducing memory access latency)
    2. Shared memory (for shared data between threads)
    3. Constant cache (for broadcasting of reads from a read-only memory)
    4. Texture cache (for aggregating bandwidth from texture memory)
  3. Schedulers for warps (these are for issuing instructions to warps based on a particular scheduling policies)
  4. A substantial number of registers (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands)

How many thread blocks at the same time?

An SM may contain up to 8 thread blocks in total.

Branch prediction?

In general, SMs support instruction-level parallelism but not branch prediction.

Links

Each architecture in GPU consists of several SM.