Model FLOPs Utilization (MFU)

The ratio of the theoretical matmul FLOPs required to train a model over the actual wall-clock FLOPs delivered by the hardware. Tells you how much of a GPU cluster’s peak throughput is doing useful training work.

Why?

Peak GPU FLOPs assume idealized matmul on full tensor cores. Real training is bottlenecked by memory bandwidth, communication across GPUs, and helper compute (attention softmax, norms, optimizer steps). MFU is the industry-standard way to judge whether a distributed-training recipe is close to hardware ceiling — “how well am I using the silicon?” It is the single number reported in Llama, Gopher, PaLM, and MT-NLG papers.

Definition (CS231n 2025 Lec 11)

$MFU = \frac{theoretical matmul FLOPs to train}{actual time \times peak hardware FLOPs}$

Numerator counts only the forward+backward matmuls of the model as defined. The measured time includes everything — communication, overlap stalls, attention’s non-matmul ops, optimizer step, activation checkpointing recompute.

HFU (Hardware FLOPs Utilization) is a looser variant that also counts extra recompute and helper compute in the numerator — so HFU $\geq$ MFU always. A single large matmul on an H100 hits ~80% HFU; that’s the ceiling for any subroutine.

Thresholds: >30% MFU is good, >40% is excellent at scale.

Published numbers

Model	Params	MFU
GPT-3	175B	21.3%
MT-NLG	530B	30.2%
Gopher	280B	32.5%
PaLM	540B	46.2%

Note: newer devices sometimes get lower MFU. A100→H100 is 3.1× peak FLOPs but only 2.1× memory bandwidth, so memory-bound sublayers (norms, softmax, optimizer) don’t scale, pulling MFU down even though wall-clock throughput goes up.

ND Parallelism — tune the recipe

Llama3-405B run three different 4D recipes across $(TP, CP, PP, D P)$ to maximize MFU at different scales:

GPUs	TP	CP	PP	DP	Seq	Batch/DP	Tok/batch	TFLOPs/GPU	BF16 MFU
8,192	8	1	16	64	8,192	32	16M	430	43%
16,384	8	1	16	128	8,192	16	16M	400	41%
16,384	8	16	16	8	131,072	16	16M	380	38%

Source: Llama Team, “The Llama3 Herd of Models”, arXiv 2024.

Read the table as: GPUs arranged in a 4D grid; each GPU’s position along the $(TP, CP, PP, D P)$ axes gives its rank. Doubling cluster size drops MFU (from 43% → 41%) because intra-node TP is saturated and the marginal GPUs go into slower DP. Going to 131k context further drops MFU to 38% because CP replaces FSDP’s fast all-reduce with ring attention’s rolling K/V exchange.

Scaling recipe (rules of thumb from the lecture)

Cluster scale	Model scale	Use
Up to ~128 GPUs	≤ ~1B params	DP only
—	> 1B params	+ FSDP
Large batch / seq	—	+ [[notes/Activation Checkpointing
> 256 GPUs	—	+ HSDP
> 1K GPUs	> 50B params or seq > 16K	+ CP / PP / TP

The parallelism dimensions map onto the $(B a t c h, S e q, D im)$ tensor axes:

DP splits Batch
Context Parallel (CP) splits Sequence — see Attention
PP splits Layers
TP splits Dim

Source

CS231n 2025 Lec 11 slides ~103–147 (MFU/HFU definitions, published model table, ND parallelism grid diagram, Llama3-405B three-recipe table, scaling thresholds, summary slide). 2026 PDF not published — using 2025 fallback.

🛠️ Steven Gong

Table of Contents

Model FLOPs Utilization (MFU)

Definition (CS231n 2025 Lec 11)

Published numbers

ND Parallelism — tune the recipe

Scaling recipe (rules of thumb from the lecture)

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Model FLOPs Utilization (MFU)

Definition (CS231n 2025 Lec 11)

Published numbers

ND Parallelism — tune the recipe

Scaling recipe (rules of thumb from the lecture)

Source

Related

Graph View

Backlinks