Heterogeneous Programming

Heterogeneous programming writes code for systems that mix processor kinds, typically CPU plus GPU. Introduced in ECE459 L21. Examples include the PS3 Cell (PowerPC + 8 SIMD coprocessors) [Ent08], CUDA, and OpenCL. The PS4 went back to a CPU+GPU on one AMD chip.

Why the split?

GPU cores are individually slower (~1.8 GHz vs 3.6 GHz CPU on ecetesla2) but there are hundreds of them (1920 CUDA cores). Offloading pays when the parallel work outweighs the setup and transfer cost.

Programming model

The same shape works across Cell, CUDA, and OpenCL:

  1. Write the massively-parallel code (kernel) separately from the main code
  2. At runtime, set up the input
  3. Transfer data to the GPU
  4. Wait while the GPU runs the kernel
  5. Transfer results back

Data parallelism is the central feature: evaluate a kernel at a set of points (the index space). CUDA also supports task parallelism (different kernels running in parallel with one-point index spaces), but the course sticks to data parallelism.

See the drive-vs-fly analogy in GPU Programming for when offload pays off.