GPU Programming
GPU programming runs a single function (a kernel) across many data points in parallel. Vendors call it SIMT (Single Instruction Multiple Thread). This is the embarrassingly-parallel regime. Introduced in ECE459 L21/L22.
Why use a GPU?
ecetesla2 has a 4-core 3.6 GHz CPU versus 1920 CUDA cores at 1.8 GHz. Half the per-core speed, 480Ă— the workers, if the workload is suitable.
Drive vs fly analogy
Short trip (Ottawa to Montreal, 200 km): flying loses to driving because of airport overhead. Long trip (Waterloo to SF, 4000 km): flying wins. The GPU pays setup plus host-to-device transfer cost, so it’s only worth it with enough parallel work.
Index space
The kernel runs over an ND-Range, a 1D/2D/3D grid of work-items (each is a thread with a unique ID). Threads group into blocks that share local memory and must be able to run in any order.
For higher dimensions you fall back to explicit loops. 2D/3D often flattens cleanly since a C [x][y] array is linear in memory anyway. For CUDA syntax and block-configuration code, see CUDA.
Memory
Vendor-neutral naming (OpenCL): private / local / global / constant / texture. CUDA’s equivalents plus sub-notes live in CUDA Memory.
N-body, 100k points on ecetesla1 (L22)
| Variant | Time |
|---|---|
| Sequential | 40.3 s |
| CPU parallel | 5.3 s |
| CUDA, one block per point, 1 thread each | 9.5 s |
CUDA, (N/256)+1 blocks of 256 threads | 1.65 s |
The third row was worse than CPU parallel: one work-item per block wastes the warp hardware (a warp is 32 threads). The winning config uses idx = threadIdx.x + blockIdx.x * blockDim.x and groups threads into 256-wide blocks.
Trading accuracy
Most gaming GeForce cards don’t have native FP64, so it’s emulated in software. Dropping FP64 to FP32 is typically 16×, 32×, or 64× faster. FP32 to FP16 adds another ~2× [JeG14]. For rendering and many ML workloads the precision isn’t needed. See Trading Accuracy for Time.