Tensor Parallelism

Tensor parallelism is a technique used to fit a large model in multiple GPUs.

https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism

Split on the Dim axis (CS231n 2025 Lec 11)

A Transformer operates on tensors of shape $(Batch, Seq, Dim)$ . The four parallelism dimensions split different axes:

Scheme	Axis
DP	Batch
Context Parallel (CP)	Sequence
PP	Layer
Tensor Parallel (TP)	Dim

For a dense layer $Y = X W$ with $W \in R^{D \times D}$ , shard $W$ column-wise across $N$ GPUs: $W = [W_{1} ∣ W_{2} ∣ \dots ∣ W_{N}]$ with $W_{i} \in R^{D \times D / N}$ . Broadcast $X$ to every GPU. GPU $i$ computes $Y_{i} = X W_{i} \in R^{N \times D / N}$ — its shard of the output. Block shapes: $X$ is $[1 \times 1]$ , $W$ is $[1 \times 4]$ , $Y$ is $[1 \times 4]$ (for 4-way TP).

The two-layer trick (no intermediate comm). If layer 1 outputs $Y = X W$ and layer 2 computes $Z = Y U$ , the naïve approach gathers $Y$ from all GPUs then broadcasts. Instead: shard $W$ column-wise (each GPU owns $Y_{i}$ ) and shard $U$ row-wise (each GPU owns $U_{i}$ ). Then $Z = Y U = \sum_{i = 1}^{N} Y_{i} U_{i}$ Each GPU computes one partial product $Y_{i} U_{i}$ locally — no communication after $X W$ . The $N$ partial sums are combined with a single all_reduce at the end. This is exactly the pattern used in the MLP block of a Transformer (up-proj column-sharded, down-proj row-sharded) and in the QKV→out pattern of attention.

TP is typically capped at the size of a single NVLink domain (8 GPUs on an H100 DGX), since each block still requires an all-reduce.

Source

CS231n 2025 Lec 11 slides ~134–143 (TP as Dim split, column sharding block diagram, 4-way TP two-layer column/row trick, “no need for communication after XW=Y”). 2026 PDF not published — using 2025 fallback.

🛠️ Steven Gong

Table of Contents

Tensor Parallelism

Split on the Dim axis (CS231n 2025 Lec 11)

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Tensor Parallelism

Split on the Dim axis (CS231n 2025 Lec 11)

Source

Related

Graph View

Backlinks