CUDA Architecture

These are fundamental:

CUDA Streaming Multiprocessor
CUDA Warp
CUDA Block
CUDA Thread

Hardware Perspective

SM $\to$ Block / Warp $\to$ Thread

Streaming Multiprocessor

SMs manage 2048 threads (or 64 warps of threads, 32 * 64 = 2048 threads).

Warp

A warp is the vector element element of the GPU, it consists of 32 consecutive threads.

You want the number of threads in a block to be a multiple of 32 so it can be scheduled. A block cannot be placed across 2 different SMs.

It is fundamental you understand the resources constraints

Warps contain consecutive threads!

Each warp consists of 32 threads of consecutive threadIdx values: thread 0-31 from the 1st warp, 32-63 for the 2nd warp, etc.

Software Perspective

Grid $\to$ Block $\to$ Thread

Maximum number of threads in block?

The maximum number of threads in the block is limited to 1024. Source

Maximum number of blocks in grid?

For blocks, it is 65535 in a single dimension. Source

Linking the two together

If you understand the hardware, you will understand why

Characteristics:

Blocks are scheduled to an SM
All threads within a particular block must reside on a single SM
SMs can handle multiple thread blocks at the same time
A thread ID is assigned to a thread by its respective SM

Whenever an SM executes a thread block it will execute entirely on that SM. All the threads inside the thread block are executed at the same time.

To free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution.

Block $\rightarrow$ Warp Conversion

Each thread block is physically broken down into warps (chunks of 32 consecutive threads).

A warp does not contain blocks; it is the other way around (a block contains 1 or more warps).

Instructions are issued at the warp level (32 consecutive threads at a time). This execution model is referred to as SIMT Architecture.

What happens if we create blocks of size $N< 32$ on the software side?

Your program’s CUDA thread block would get mapped to one hardware warp. Certain threads of the warp would get masked off and the code would run at lower than peak efficiency. You can think of it as if the warp was divergent for the entire program. Source

Each warp is executed in a SIMD fashion (all threads within a warp execute the same instruction at any given time)

One SM per block!

A block runs on a single SM. It can never span 2 different SMs.

The number of thread blocks in a grid is usually dictated by the size of the data being processed, which typically exceeds the number of processors in the system.

Better Performance

This is fundamental to getting better performance:

Choose a number of blocks that is a multiple of the SMs number (because blocks are spread out over SMs)

Choose a block size that is a multiple of 32 (because warps consists of 32 threads)

GPU Architecture

🛠️ Steven Gong

Table of Contents

CUDA Architecture

Hardware Perspective

Software Perspective

Linking the two together

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

CUDA Architecture

Hardware Perspective

Software Perspective

Linking the two together

Related

Graph View

Backlinks