GPU Optimization
This note is written my own words about how I think about optimizing GPUs.

- Source: Jane Street ML Talk
Some fundamental ideas behind optimization
- Principle of Locality for GPUs / Data reuse
- We want to keep data in Registers as much as possible for as long as possible to minimize the overhead of slow memory transfers i.e. load once, use many times
- Latency Hiding
- Oftentimes, there’s quite a big overhead in memcpy. Oftentimes, we can hide this latency behind computation by doing Prefetching
- TODO: insert profiling of before and after of profiling
The 3 types of bottlenecks
- Memory bottleneck: When memory copies are the main bottleneck of the program. The GPU is busy moving data around, and spending very little time actually doing computation. This is bad because we can be doing more computation>
- Compute bottleneck: Most of the time is spent doing computation. Thisis actually generally good, means you are already in a good state. But maybe you can optmize computation even more. Maybe debug with NVIDIA Nsight Compute.
- Kernel overhead: Here, we’re referring to the launch overhead.
-
Memory bottleneck (within GPU):
-
Compute Bottleneck:
-
Kernel Overhead:
- Kernel Fusion
- Leverage CUDA Graph

- a visualization of kernel launching
Other
Use Jupyter Nsight integration to really get good at profiling your cuda programs.