Tensor Parallelism
Tensor parallelism is a technique used to fit a large model in multiple GPUs.
https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism
Split on the Dim axis (CS231n 2025 Lec 11)
A Transformer operates on tensors of shape . The four parallelism dimensions split different axes:
For a dense layer with , shard column-wise across GPUs: with . Broadcast to every GPU. GPU computes β its shard of the output. Block shapes: is , is , is (for 4-way TP).
The two-layer trick (no intermediate comm). If layer 1 outputs and layer 2 computes , the naΓ―ve approach gathers from all GPUs then broadcasts. Instead: shard column-wise (each GPU owns ) and shard row-wise (each GPU owns ). Then
Each GPU computes one partial product locally β no communication after . The partial sums are combined with a single all_reduce at the end. This is exactly the pattern used in the MLP block of a Transformer (up-proj column-sharded, down-proj row-sharded) and in the QKVβout pattern of attention.
TP is typically capped at the size of a single NVLink domain (8 GPUs on an H100 DGX), since each block still requires an all-reduce.
Source
CS231n 2025 Lec 11 slides ~134β143 (TP as Dim split, column sharding block diagram, 4-way TP two-layer column/row trick, βno need for communication after XW=Yβ). 2026 PDF not published β using 2025 fallback.