CUDA Fundamentals
First got these fundamentals through the Accelerated Computing Course. Also read PMPP.
Below is a .cu
file:
__global__ void GPUFunction()
- The
__global__
keyword indicates that the following function will run on the GPU, and can be invoked globally (either CPU or GPU) - Code executed on the CPU is referred to as host code, and code running on the GPU is referred to as device code
Important
Functions with
__global__
keyword must always return typevoid
.
GPUFunction<<<1, 1>>>();
-
This GPU function is called a kernel
-
When launching a kernel, we must provide an execution configuration, which is done by using the
<<< ... >>>
syntax just prior to passing the kernel any expected arguments -
At a high level, execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.
Thread Hierarchy
- 1st execution configuration argument = number of blocks
- 2nd configuration argument = number of threads
So a collection of threads is simply a block.
Important
Unlike much C/C++ code, launching kernels is asynchronous: the CPU code will continue to execute without waiting for the kernel launch to complete.
A call to cudaDeviceSynchronize
will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.
- Each thread is given an index within its thread block, starting at
0
- Each block is given an index, starting at
0
- Blocks are grouped into a grid, which is the highest entity in the CUDA thread hierarchy.
CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.
Why can't blocks have different number of threads?
I assume for optimization reasons, or is it just not possible? Yes, because of the way the hardware is actually implemented.
So it’s quite simple:
blockDim.x
to determine the number of threads in a block- You can use the expression
threadIdx.x + blockIdx.x * blockDim.x
gridDim.x
gives the number of blocks in a grid
Querying GPU Device Properties
Perfomance gains can often be had by choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.
This is why you might want to determine the number of SMs on a GPU.
Refer to https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html for the cudaDeviceProp
struct reference..