CUDA Kernel

A CUDA kernel is the function that every CUDA thread runs somewhat in parallel on the GPU. Introduced in ECE459 L21. It compiles to PTX (Parallel Thread eXecution) via nvcc -ptx foo.cu. You can hand-write PTX but normally don’t. Recompile when moving machines.

Why write separate kernel code?

Host code runs on the CPU and orchestrates transfers. The kernel runs on the GPU’s many cores. They have different instruction sets, memory, and safety model, so they’re compiled separately.

Writing a kernel

Vector add, where the loop disappears into the index space:

__global__ void vector_add(float* A, float* B, float* C) {
    int i = blockIdx.x;
    C[i] = A[i] + B[i];
}

threadIdx / blockIdx are three-component vectors (.x, .y, .z). Use only the components matching your problem dimensionality.

Function qualifiers:

  • __global__: entry point, called from host, can call __device__ functions but not other __global__
  • __device__: called from GPU only
  • extern "C": disable C++ name mangling so the host code can find the symbol

No Rust safety guarantees

You can read past arrays, skip init, leak memory, race. Compiling says nothing about correctness, so always check output against expectations. Example failure seen in class:

thread 'main' panicked at 'Failed to deallocate CUDA Device memory.: IllegalAddress'

Vector types

float4, float3, int2, etc. are groups of 1 to 4 primitives with .x .y .z .w. They’re convenient packaging without a custom struct. Structs are also allowed.

N-body kernel (L21)

__device__ void body_body_interaction(float4 p1, float4 p2, float3 *acc) { ... }
 
extern "C" __global__ void calculate_forces(const float4* positions,
                                            float3* accelerations, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= n) return;
    float4 current = positions[idx];
    float3 acc     = accelerations[idx];
    for (int i = 0; i < n; i++)
        body_body_interaction(current, positions[i], &acc);
    accelerations[idx] = acc;
}

Branch divergence

SIMT has no branch prediction, no speculative execution. Hardware executes every branch any thread in the warp took, then keeps only the valid results. Conditionals get expensive fast.

if (condition()) a[blockIdx.x] += 5.0;  // both sides run,
else             b[blockIdx.x] += 5.0;  // wrong results thrown away

Loops: the workgroup waits for the max iterations across work-items. Compile-time-constant loop counts can be unrolled:

for (int i = 0; i < 12; ++i) p1[i] += p2[i]*2;

Atomics

Race conditions are still possible. Atomics exist for int/float etc.: add, sub, min, inc, dec, CAS, and bitwise. Example atomicAdd for doubles, built on atomicCAS (integer compare avoids a NaN-hang):

__device__ double atomicAdd(double* address, double val) {
    unsigned long long* a = (unsigned long long*)address;
    unsigned long long old = *a, assumed;
    do {
        assumed = old;
        old = atomicCAS(a, assumed,
              __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Choosing dimensionality

Make the outer loop the grid. For N-body, the per-pair work is too tiny to be worth a work-item per pair, so 1D (per body) beats 2D (per pair). Different problems tip the other way. You can also flatten dimensions, since C arrays are linear under the hood.

Host launch (Rustacuda, L22)

  1. rustacuda::init(CudaFlags::empty()): the first call auto-initializes the runtime, so early errors can actually be setup bugs
  2. Device::get_device(0)
  3. Context::create_and_push(...): analogous to a process with its own address space. Each host thread keeps a stack of current contexts, and the context must stay in scope
  4. Module::load_from_string(ptx): PTX bytes loaded via include_str!
  5. Stream::new(StreamFlags::DEFAULT, None): commands on the same stream are ordered, different streams are unordered. Use DEFAULT so kernel launches run after the memory copies issued by DeviceBuffer::from_slice
  6. DeviceBuffer::from_slice(...): explicit H→D transfer
  7. unsafe { launch!(module.fn<<<grid, block, shm, stream>>>(args...)) }: unsafe only around the launch
  8. stream.synchronize(): kernel launches are async
  9. buffer.copy_to(&mut host): D→H

DeviceCopy

Rust types going to the GPU must implement DeviceCopy, a promise of no host-memory pointers. cuda-sys::vector_types gives float3 / float4; wrap with a newtype plus unsafe impl DeviceCopy and Deref.