CUDA Fundamentals

First got these fundamentals through the Accelerated Computing Course. Also read PMPP.

Below is a .cu file:

void CPUFunction()
  printf("This function is defined to run on the CPU.\n");
__global__ void GPUFunction()
  printf("This function is defined to run on the GPU.\n");
int main()
  GPUFunction<<<1, 1>>>();
  cudaDeviceSynchronize(); // IMPORTANT, else the GPU won't print because the function doesn't wait 

__global__ void GPUFunction()

  • The __global__ keyword indicates that the following function will run on the GPU, and can be invoked globally (either CPU or GPU)
  • Code executed on the CPU is referred to as host code, and code running on the GPU is referred to as device code


Functions with __global__ keyword must always return type void.

GPUFunction<<<1, 1>>>();

  • This GPU function is called a kernel

  • When launching a kernel, we must provide an execution configuration, which is done by using the <<< ... >>> syntax just prior to passing the kernel any expected arguments

  • At a high level, execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread groupings (called blocks), as well as how many threads to execute in each block.

Thread Hierarchy

  • 1st execution configuration argument = number of blocks
  • 2nd configuration argument = number of threads

So a collection of threads is simply a block.


Unlike much C/C++ code, launching kernels is asynchronous: the CPU code will continue to execute without waiting for the kernel launch to complete.

A call to cudaDeviceSynchronize will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.

  • Each thread is given an index within its thread block, starting at 0
  • Each block is given an index, starting at 0
  • Blocks are grouped into a grid, which is the highest entity in the CUDA thread hierarchy.

CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.

Why can't blocks have different number of threads?

I assume for optimization reasons, or is it just not possible? Yes, because of the way the hardware is actually implemented.

So it’s quite simple:

  • blockDim.x to determine the number of threads in a block
  • You can use the expression threadIdx.x + blockIdx.x * blockDim.x
  • gridDim.x gives the number of blocks in a grid

Querying GPU Device Properties

Perfomance gains can often be had by choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.

This is why you might want to determine the number of SMs on a GPU.

int deviceId;
cudaGetDevice(&deviceId);                  // `deviceId` now points to the id of the currently active GPU.
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
                                           // the active GPU device.

Refer to for the cudaDeviceProp struct reference..