CUDA Fundamentals

First got these fundamentals through the Accelerated Computing Course. Also read PMPP.

Below is a .cu file:

void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}
 
__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}
 
int main()
{
  CPUFunction();
 
  GPUFunction<<<1, 1>>>();
  cudaDeviceSynchronize(); // IMPORTANT, else the GPU won't print because the function doesn't wait 
}

__global__ void GPUFunction()

The __global__ keyword indicates that the following function will run on the GPU, and can be invoked globally (either CPU or GPU)
Code executed on the CPU is referred to as host code, and code running on the GPU is referred to as device code

Important

Functions with __global__ keyword must always return type void.

GPUFunction<<<1, 1>>>();

This GPU function is called a kernel
When launching a kernel, we must provide an execution configuration, which is done by using the <<< ... >>> syntax just prior to passing the kernel any expected arguments
At a high level, execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.

Thread Hierarchy

1st execution configuration argument = number of blocks

2nd configuration argument = number of threads

So a collection of threads is simply a block.

Important

Unlike much C/C++ code, launching kernels is asynchronous: the CPU code will continue to execute without waiting for the kernel launch to complete.

A call to cudaDeviceSynchronize will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.

Each thread is given an index within its thread block, starting at 0
Each block is given an index, starting at 0
Blocks are grouped into a grid, which is the highest entity in the CUDA thread hierarchy.

CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.

Why can't blocks have different number of threads?

I assume for optimization reasons, or is it just not possible? Yes, because of the way the hardware is actually implemented.

So it’s quite simple:

blockDim.x to determine the number of threads in a block
You can use the expression threadIdx.x + blockIdx.x * blockDim.x
gridDim.x gives the number of blocks in a grid

Querying GPU Device Properties

Perfomance gains can often be had by choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.

This is why you might want to determine the number of SMs on a GPU.

int deviceId;
cudaGetDevice(&deviceId);                  // `deviceId` now points to the id of the currently active GPU.
 
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
                                           // the active GPU device.

Refer to https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html for the cudaDeviceProp struct reference..

🛠️ Steven Gong

Table of Contents

CUDA Fundamentals

Querying GPU Device Properties

Graph View

Backlinks