CUDA Kernel
A CUDA kernel is the function that every CUDA thread runs somewhat in parallel on the GPU. Introduced in ECE459 L21. It compiles to PTX (Parallel Thread eXecution) via nvcc -ptx foo.cu. You can hand-write PTX but normally don’t. Recompile when moving machines.
Why write separate kernel code?
Host code runs on the CPU and orchestrates transfers. The kernel runs on the GPU’s many cores. They have different instruction sets, memory, and safety model, so they’re compiled separately.
Writing a kernel
Vector add, where the loop disappears into the index space:
__global__ void vector_add(float* A, float* B, float* C) {
int i = blockIdx.x;
C[i] = A[i] + B[i];
}threadIdx / blockIdx are three-component vectors (.x, .y, .z). Use only the components matching your problem dimensionality.
Function qualifiers:
__global__: entry point, called from host, can call__device__functions but not other__global____device__: called from GPU onlyextern "C": disable C++ name mangling so the host code can find the symbol
No Rust safety guarantees
You can read past arrays, skip init, leak memory, race. Compiling says nothing about correctness, so always check output against expectations. Example failure seen in class:
thread 'main' panicked at 'Failed to deallocate CUDA Device memory.: IllegalAddress'
Vector types
float4, float3, int2, etc. are groups of 1 to 4 primitives with .x .y .z .w. They’re convenient packaging without a custom struct. Structs are also allowed.
N-body kernel (L21)
__device__ void body_body_interaction(float4 p1, float4 p2, float3 *acc) { ... }
extern "C" __global__ void calculate_forces(const float4* positions,
float3* accelerations, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
float4 current = positions[idx];
float3 acc = accelerations[idx];
for (int i = 0; i < n; i++)
body_body_interaction(current, positions[i], &acc);
accelerations[idx] = acc;
}Branch divergence
SIMT has no branch prediction, no speculative execution. Hardware executes every branch any thread in the warp took, then keeps only the valid results. Conditionals get expensive fast.
if (condition()) a[blockIdx.x] += 5.0; // both sides run,
else b[blockIdx.x] += 5.0; // wrong results thrown awayLoops: the workgroup waits for the max iterations across work-items. Compile-time-constant loop counts can be unrolled:
for (int i = 0; i < 12; ++i) p1[i] += p2[i]*2;Atomics
Race conditions are still possible. Atomics exist for int/float etc.: add, sub, min, inc, dec, CAS, and bitwise. Example atomicAdd for doubles, built on atomicCAS (integer compare avoids a NaN-hang):
__device__ double atomicAdd(double* address, double val) {
unsigned long long* a = (unsigned long long*)address;
unsigned long long old = *a, assumed;
do {
assumed = old;
old = atomicCAS(a, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}Choosing dimensionality
Make the outer loop the grid. For N-body, the per-pair work is too tiny to be worth a work-item per pair, so 1D (per body) beats 2D (per pair). Different problems tip the other way. You can also flatten dimensions, since C arrays are linear under the hood.
Host launch (Rustacuda, L22)
rustacuda::init(CudaFlags::empty()): the first call auto-initializes the runtime, so early errors can actually be setup bugsDevice::get_device(0)Context::create_and_push(...): analogous to a process with its own address space. Each host thread keeps a stack of current contexts, and the context must stay in scopeModule::load_from_string(ptx): PTX bytes loaded viainclude_str!Stream::new(StreamFlags::DEFAULT, None): commands on the same stream are ordered, different streams are unordered. UseDEFAULTso kernel launches run after the memory copies issued byDeviceBuffer::from_sliceDeviceBuffer::from_slice(...): explicit H→D transferunsafe { launch!(module.fn<<<grid, block, shm, stream>>>(args...)) }:unsafeonly around the launchstream.synchronize(): kernel launches are asyncbuffer.copy_to(&mut host): D→H
DeviceCopy
Rust types going to the GPU must implement
DeviceCopy, a promise of no host-memory pointers.cuda-sys::vector_typesgivesfloat3/float4; wrap with a newtype plusunsafe impl DeviceCopyandDeref.