Model Parallel

Tensor Parallelism

Tensor parallelism is a technique used to fit a large model in multiple GPUs.

https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism

Split on the Dim axis (CS231n 2025 Lec 11)

A Transformer operates on tensors of shape . The four parallelism dimensions split different axes:

SchemeAxis
DPBatch
Context Parallel (CP)Sequence
PPLayer
Tensor Parallel (TP)Dim

For a dense layer with , shard column-wise across GPUs: with . Broadcast to every GPU. GPU computes β€” its shard of the output. Block shapes: is , is , is (for 4-way TP).

The two-layer trick (no intermediate comm). If layer 1 outputs and layer 2 computes , the naïve approach gathers from all GPUs then broadcasts. Instead: shard column-wise (each GPU owns ) and shard row-wise (each GPU owns ). Then Each GPU computes one partial product locally — no communication after . The partial sums are combined with a single all_reduce at the end. This is exactly the pattern used in the MLP block of a Transformer (up-proj column-sharded, down-proj row-sharded) and in the QKV→out pattern of attention.

TP is typically capped at the size of a single NVLink domain (8 GPUs on an H100 DGX), since each block still requires an all-reduce.

Source

CS231n 2025 Lec 11 slides ~134–143 (TP as Dim split, column sharding block diagram, 4-way TP two-layer column/row trick, β€œno need for communication after XW=Y”). 2026 PDF not published β€” using 2025 fallback.