Activation Checkpointing

Training trades memory for compute by recomputing intermediate activations during the backward pass instead of storing them from the forward pass.

Why?

A standard forward pass caches every activation (each layer’s output) so backward can use them to compute gradients. For a deep model the activation tape dominates GPU memory — a Llama3-405B FFN with bf16, batch 1, seq 4096 eats 63GB just for one layer’s activations: $2 \cdot 126 \cdot (4 \cdot 16384) \cdot 4096$ bytes. Activation checkpointing lets you keep only a few activations, recomputing the rest when needed, trading a larger backward-pass compute budget for smaller peak memory.

The compute/memory Pareto (CS231n 2025 Lec 11)

For a linear chain of $N$ layers, three schemes sit on the tradeoff curve:

Scheme	Forward compute	Backward compute	Peak act memory
Standard (cache everything)	$O (N)$	$O (N)$	$O (N)$
Full recompute (no cache)	$O (N)$	$O (N^{2})$	$O (1)$
$C$ checkpoints	$O (N)$	$O (N^{2} / C)$	$O (C)$
$N$ checkpoints (optimal)	$O (N)$	$O (N N)$	$O (N)$

Full recompute keeps only the input. To backprop through layer $k$ you first re-run the forward from input through layer $k$ , which is $k$ flops, summed over $k = 1.. N$ → $O (N^{2})$ backward compute.

$C$ checkpoints divides the $N$ layers into $C$ chunks of size $N / C$ . Only chunk boundaries are cached. To backprop through any chunk, re-run forward from the chunk’s starting checkpoint (cost $N / C$ ) then backprop it (cost $N / C$ ). Summed over $C$ chunks → $O (N^{2} / C)$ backward compute, $O (C)$ memory.

Setting $C = N$ minimizes the product: $O (N N)$ compute and $O (N)$ memory. This is the classical result — the curve bottoms out at $N$ .

In practice

Checkpoints live at Transformer block granularity — each block’s input is cached, the 6 matmuls + attention matrix inside are recomputed in backward.
Hurts MFU somewhat because backward now does extra matmuls that don’t count as “useful” FLOPs, but lets you fit a much larger batch or model — usually a net throughput win.
Essential combined with FSDP at scale: FSDP shards params, checkpointing shards activations.
PyTorch exposes torch.utils.checkpoint.checkpoint as a wrapper.

Source

CS231n 2025 Lec 11 slides ~71–102 (forward/backward cache diagram, full-recompute $O (N^{2})$ analysis, $C$ -checkpoint derivation, $N$ optimum, why it matters next to FSDP). 2026 PDF not published — using 2025 fallback.

🛠️ Steven Gong

Table of Contents

Activation Checkpointing

The compute/memory Pareto (CS231n 2025 Lec 11)

In practice

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Activation Checkpointing

The compute/memory Pareto (CS231n 2025 Lec 11)

In practice

Source

Related

Graph View

Backlinks