GPU Programming

GPU programming runs a single function (a kernel) across many data points in parallel. Vendors call it SIMT (Single Instruction Multiple Thread). This is the embarrassingly-parallel regime. Introduced in ECE459 L21/L22.

Why use a GPU?

ecetesla2 has a 4-core 3.6 GHz CPU versus 1920 CUDA cores at 1.8 GHz. Half the per-core speed, 480Ă— the workers, if the workload is suitable.

Drive vs fly analogy

Short trip (Ottawa to Montreal, 200 km): flying loses to driving because of airport overhead. Long trip (Waterloo to SF, 4000 km): flying wins. The GPU pays setup plus host-to-device transfer cost, so it’s only worth it with enough parallel work.

Index space

The kernel runs over an ND-Range, a 1D/2D/3D grid of work-items (each is a thread with a unique ID). Threads group into blocks that share local memory and must be able to run in any order.

For higher dimensions you fall back to explicit loops. 2D/3D often flattens cleanly since a C [x][y] array is linear in memory anyway. For CUDA syntax and block-configuration code, see CUDA.

Memory

Vendor-neutral naming (OpenCL): private / local / global / constant / texture. CUDA’s equivalents plus sub-notes live in CUDA Memory.

N-body, 100k points on ecetesla1 (L22)

VariantTime
Sequential40.3 s
CPU parallel5.3 s
CUDA, one block per point, 1 thread each9.5 s
CUDA, (N/256)+1 blocks of 256 threads1.65 s

The third row was worse than CPU parallel: one work-item per block wastes the warp hardware (a warp is 32 threads). The winning config uses idx = threadIdx.x + blockIdx.x * blockDim.x and groups threads into 256-wide blocks.

Trading accuracy

Most gaming GeForce cards don’t have native FP64, so it’s emulated in software. Dropping FP64 to FP32 is typically 16×, 32×, or 64× faster. FP32 to FP16 adds another ~2× [JeG14]. For rendering and many ML workloads the precision isn’t needed. See Trading Accuracy for Time.