Model Parallel

Pipeline Parallelism

Use of Pipelining for GPUs to do training.

https://colossalai.org/docs/concepts/paradigms_of_parallelism/

GPipe paper:

CS231n 2025 Lec 11 — bubble problem + microbatches

Split the layers across GPUs so GPU holds layers . Forward: activations flow GPU 0 → 1 → … → . Backward: gradients flow back. Bubble problem: at any instant only one GPU is active — the rest wait on their upstream or downstream neighbor. Max MFU = .

Microbatch fix (Huang et al. GPipe, NeurIPS 2019): split the batch into microbatches and pipeline them. As soon as microbatch 1 finishes on GPU 0, start microbatch 2 there. Forward pass fills the pipeline; backward drains it. With stages and microbatches, active-time fraction = = .

Worked example from the lecture: 4-way PP with 4 microbatches → active time. The bubble on each end is unavoidable; you amortize it by making .

Source

CS231n 2025 Lec 11 slides ~127–133 (PP bubble diagram, microbatch scheduling, active-time formula, 4-stage × 4-microbatch 57.1% example). 2026 PDF not published — using 2025 fallback.