🛠️ Steven Gong

Search

Feb 11, 2026, 1 min read

GPU Optimization

In my own words: Optimizing GPUs.

Source: Jane Street ML Talk

There are some fundamental ideas behind optimization:

Principle of Locality for GPUs
- We want to keep data in Registers as much as possible for as long as possible. It’s
Latency Hiding
- Oftentimes, there’s quite a big overhead in memcpy. Oftentimes, we can hide this latency behind computation (leveraging 2 different CUDA Streams) for example.
- TODO: insert profiling of before and after of profiling

Memory bottleneck (within GPU):

Kernel Fusion

Compute Bottleneck:

Tensor Core

Kernel Overhead:

Kernel Fusion
Leverage CUDA Graph

Graph View

Backlinks

Kernel Fusion

Created with Quartz, © 2026

Blog
LinkedIn
Twitter
GitHub