CUDA Graph
CUDA Graphs have been designed to allow work to be defined as graphs rather than single operation.
Resources
- https://developer.nvidia.com/blog/cuda-graphs/
- https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
Why do you need CUDA graphs when you have torch.compile? they do different things, CUDA graphs help eliminate the launch overhead
So after compile, you might still have:
CPU launches fused_kernel
CPU launches matmul_kernel
CPU launches fused_kernel_2
CPU launches allreduce
CUDA graph turns that into:
CPU launches entire recorded step once