Torch Distributed
Also see Collective Operations.


Inside a node → NCCL uses NVLink
Across nodes → NCCL uses RDMA/IB
Fundamental concepts:
import torch.distributed as distYOuu should just read this guide: https://lambda.ai/blog/multi-node-pytorch-distributed-training-guide
Some example. here:
torchrun \
--nnodes=2 \
--nproc_per_node=8 \
--rdzv_id=100 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29400 \
elastic_ddp.py
DDP under the hood
grad = sum(grad over all ranks) / world_size
- This is just an AllReduce operation!