Torch Distributed

Also see Collective Operations.

Inside a node → NCCL uses NVLink
Across nodes → NCCL uses RDMA/IB

Fundamental concepts:

import torch.distributed as dist

YOuu should just read this guide: https://lambda.ai/blog/multi-node-pytorch-distributed-training-guide

Some example. here:

torchrun \
  --nnodes=2 \
  --nproc_per_node=8 \
  --rdzv_id=100 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:29400 \
  elastic_ddp.py

DDP under the hood

grad = sum(grad over all ranks) / world_size