CUDA Architecture

Streaming Multiprocessor (SM)

A Streaming Multiprocessor (SM) is a fundamental component of NVIDIA GPUs, consisting of multiple Stream Processors (CUDA Core) responsible for executing instructions in parallel.

Where are CUDA Warps inside this diagram??

They’re not physical physical storage units. It’s like asking where are threads in this diagram… Warps just contain 32 threads. Warps contain several instructions that comes from writing your CUDA Kernel. They leverage a particular set of registers on the register file, and do computation.

These are general purpose processors with a low clock rate target and a small cache.

Task of SM

SMs execute several thread blocks in parallel. As soon as one of its thread block has completed execution, it takes up the serially next thread block.

From Stephen Jones, I learned that each SM can managed 64 warps, so a total of 2048 threads. However, it really processes 4 warps at a time. See CUDA Hardware Scheduling.

Resources

Inside an SM

An SM contains the following:

  1. Execution cores (single precision floating-point units, double precision floating-point units, special function units (SFUs)) → called CUDA Core
  2. Caches
    1. L1 cache (for reducing memory access latency)
    2. Shared memory (for shared data between threads)
    3. Constant cache (for broadcasting of reads from a read-only memory)
    4. Texture cache (for aggregating bandwidth from texture memory)
  3. Schedulers for warps (these are for issuing instructions to warps based on a particular scheduling policies)
  4. A substantial number of registers (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands)

How many thread blocks at the same time?

An SM may contain up to 8 thread blocks in total.

Branch prediction?

In general, SMs support instruction-level parallelism but not branch prediction.

Each architecture in GPU consists of several SM.