CUDA Fundamentals
First got these fundamentals through the Accelerated Computing Course. Also read PMPP.
Below is a .cu
file:
void CPUFunction()
{
printf("This function is defined to run on the CPU.\n");
}
__global__ void GPUFunction()
{
printf("This function is defined to run on the GPU.\n");
}
int main()
{
CPUFunction();
GPUFunction<<<1, 1>>>();
cudaDeviceSynchronize(); // IMPORTANT, else the GPU won't print because the function doesn't wait
}
__global__ void GPUFunction()
- The
__global__
keyword indicates that the following function will run on the GPU, and can be invoked globally (either CPU or GPU) - Code executed on the CPU is referred to as host code, and code running on the GPU is referred to as device code
Important
Functions with
__global__
keyword must always return typevoid
.
GPUFunction<<<1, 1>>>();
-
This GPU function is called a kernel
-
When launching a kernel, we must provide an execution configuration, which is done by using the
<<< ... >>>
syntax just prior to passing the kernel any expected arguments -
At a high level, execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.
Thread Hierarchy
- 1st execution configuration argument = number of blocks
- 2nd configuration argument = number of threads
So a collection of threads is simply a block.
Important
Unlike much C/C++ code, launching kernels is asynchronous: the CPU code will continue to execute without waiting for the kernel launch to complete.
A call to cudaDeviceSynchronize
will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.
- Each thread is given an index within its thread block, starting at
0
- Each block is given an index, starting at
0
- Blocks are grouped into a grid, which is the highest entity in the CUDA thread hierarchy.
CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.
Why can't blocks have different number of threads?
I assume for optimization reasons, or is it just not possible? Yes, because of the way the hardware is actually implemented.
So it’s quite simple:
blockDim.x
to determine the number of threads in a block- You can use the expression
threadIdx.x + blockIdx.x * blockDim.x
gridDim.x
gives the number of blocks in a grid
Querying GPU Device Properties
Perfomance gains can often be had by choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.
This is why you might want to determine the number of SMs on a GPU.
int deviceId;
cudaGetDevice(&deviceId); // `deviceId` now points to the id of the currently active GPU.
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
// the active GPU device.
Refer to https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html for the cudaDeviceProp
struct reference..