CUDA Graph

CUDA Graphs have been designed to allow work to be defined as graphs rather than single operation.

Resources

Why do you need CUDA graphs when you have torch.compile? they do different things, CUDA graphs help eliminate the launch overhead

So after compile, you might still have:

CPU launches fused_kernel
CPU launches matmul_kernel
CPU launches fused_kernel_2
CPU launches allreduce

CUDA graph turns that into:

CPU launches entire recorded step once