CUDA Fundamentals
First got these fundamentals through the Accelerated Computing Course. Also read PMPP.
Below is a .cu
void CPUFunction()
printf("This function is defined to run on the CPU.\n");
__global__ void GPUFunction()
printf("This function is defined to run on the GPU.\n");
int main()
GPUFunction<<<1, 1>>>();
cudaDeviceSynchronize(); // IMPORTANT, else the GPU won't print because the function doesn't wait
__global__ void GPUFunction()
- The
keyword indicates that the following function will run on the GPU, and can be invoked globally (either CPU or GPU) - Code executed on the CPU is referred to as host code, and code running on the GPU is referred to as device code
Functions with
keyword must always return typevoid
GPUFunction<<<1, 1>>>();
This GPU function is called a kernel
When launching a kernel, we must provide an execution configuration, which is done by using the
<<< ... >>>
syntax just prior to passing the kernel any expected arguments -
At a high level, execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.
Thread Hierarchy
- 1st execution configuration argument = number of blocks
- 2nd configuration argument = number of threads
So a collection of threads is simply a block.
Unlike much C/C++ code, launching kernels is asynchronous: the CPU code will continue to execute without waiting for the kernel launch to complete.
A call to cudaDeviceSynchronize
will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.
- Each thread is given an index within its thread block, starting at
- Each block is given an index, starting at
- Blocks are grouped into a grid, which is the highest entity in the CUDA thread hierarchy.
CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.
Why can't blocks have different number of threads?
I assume for optimization reasons, or is it just not possible? Yes, because of the way the hardware is actually implemented.
So it’s quite simple:
to determine the number of threads in a block- You can use the expression
threadIdx.x + blockIdx.x * blockDim.x
gives the number of blocks in a grid
Querying GPU Device Properties
Perfomance gains can often be had by choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.
This is why you might want to determine the number of SMs on a GPU.
int deviceId;
cudaGetDevice(&deviceId); // `deviceId` now points to the id of the currently active GPU.
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
// the active GPU device.
Refer to for the cudaDeviceProp
struct reference..