Pipeline Parallelism

Use of Pipelining for GPUs to do training.

https://colossalai.org/docs/concepts/paradigms_of_parallelism/

GPipe paper:

https://arxiv.org/abs/1811.06965

CS231n 2025 Lec 11 — bubble problem + microbatches

Split the $L$ layers across $N$ GPUs so GPU $i$ holds layers $[i L / N, (i + 1) L / N)$ . Forward: activations flow GPU 0 → 1 → … → $N - 1$ . Backward: gradients flow back. Bubble problem: at any instant only one GPU is active — the rest wait on their upstream or downstream neighbor. Max MFU = $1/ N$ .

Microbatch fix (Huang et al. GPipe, NeurIPS 2019): split the batch into $M$ microbatches and pipeline them. As soon as microbatch 1 finishes on GPU 0, start microbatch 2 there. Forward pass fills the pipeline; backward drains it. With $N$ stages and $M$ microbatches, active-time fraction = $\frac{2 NM}{2 NM + 2 ( N - 1 )}$ = $\frac{NM}{NM + N - 1}$ .

Worked example from the lecture: 4-way PP with 4 microbatches → $16/28 \approx 57.1%$ active time. The $N - 1$ bubble on each end is unavoidable; you amortize it by making $M ≫ N$ .

Source

CS231n 2025 Lec 11 slides ~127–133 (PP bubble diagram, microbatch scheduling, active-time formula, 4-stage × 4-microbatch 57.1% example). 2026 PDF not published — using 2025 fallback.

🛠️ Steven Gong

Table of Contents

Pipeline Parallelism

CS231n 2025 Lec 11 — bubble problem + microbatches

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Pipeline Parallelism

CS231n 2025 Lec 11 — bubble problem + microbatches

Source

Related

Graph View

Backlinks