GPU Optimization

This note is written my own words about how I think about optimizing GPUs.

Some fundamental ideas behind optimization

  • Principle of Locality for GPUs / Data reuse
    • We want to keep data in Registers as much as possible for as long as possible to minimize the overhead of slow memory transfers i.e. load once, use many times
  • Latency Hiding
    • Oftentimes, there’s quite a big overhead in memcpy. Oftentimes, we can hide this latency behind computation by doing Prefetching
    • TODO: insert profiling of before and after of profiling

The 3 types of bottlenecks

  • Memory bottleneck: When memory copies are the main bottleneck of the program. The GPU is busy moving data around, and spending very little time actually doing computation. This is bad because we can be doing more computation>
  • Compute bottleneck: Most of the time is spent doing computation. Thisis actually generally good, means you are already in a good state. But maybe you can optmize computation even more. Maybe debug with NVIDIA Nsight Compute.
  • Kernel overhead: Here, we’re referring to the launch overhead.
  1. Memory bottleneck (within GPU):

  2. Compute Bottleneck:

  3. Kernel Overhead:

  • a visualization of kernel launching

Other

Use Jupyter Nsight integration to really get good at profiling your cuda programs.