Model FLOPs Utilization (MFU)

The ratio of the theoretical matmul FLOPs required to train a model over the actual wall-clock FLOPs delivered by the hardware. Tells you how much of a GPU cluster’s peak throughput is doing useful training work.

Why?

Peak GPU FLOPs assume idealized matmul on full tensor cores. Real training is bottlenecked by memory bandwidth, communication across GPUs, and helper compute (attention softmax, norms, optimizer steps). MFU is the industry-standard way to judge whether a distributed-training recipe is close to hardware ceiling — “how well am I using the silicon?” It is the single number reported in Llama, Gopher, PaLM, and MT-NLG papers.

Definition (CS231n 2025 Lec 11)

Numerator counts only the forward+backward matmuls of the model as defined. The measured time includes everything — communication, overlap stalls, attention’s non-matmul ops, optimizer step, activation checkpointing recompute.

HFU (Hardware FLOPs Utilization) is a looser variant that also counts extra recompute and helper compute in the numerator — so HFU MFU always. A single large matmul on an H100 hits ~80% HFU; that’s the ceiling for any subroutine.

Thresholds: >30% MFU is good, >40% is excellent at scale.

Published numbers

ModelParamsMFU
GPT-3175B21.3%
MT-NLG530B30.2%
Gopher280B32.5%
PaLM540B46.2%

Note: newer devices sometimes get lower MFU. A100→H100 is 3.1× peak FLOPs but only 2.1× memory bandwidth, so memory-bound sublayers (norms, softmax, optimizer) don’t scale, pulling MFU down even though wall-clock throughput goes up.

ND Parallelism — tune the recipe

Llama3-405B run three different 4D recipes across to maximize MFU at different scales:

GPUsTPCPPPDPSeqBatch/DPTok/batchTFLOPs/GPUBF16 MFU
8,1928116648,1923216M43043%
16,38481161288,1921616M40041%
16,384816168131,0721616M38038%

Source: Llama Team, “The Llama3 Herd of Models”, arXiv 2024.

Read the table as: GPUs arranged in a 4D grid; each GPU’s position along the axes gives its rank. Doubling cluster size drops MFU (from 43% → 41%) because intra-node TP is saturated and the marginal GPUs go into slower DP. Going to 131k context further drops MFU to 38% because CP replaces FSDP’s fast all-reduce with ring attention’s rolling K/V exchange.

Scaling recipe (rules of thumb from the lecture)

Cluster scaleModel scaleUse
Up to ~128 GPUs≤ ~1B paramsDP only
—> 1B params+ FSDP
Large batch / seq—+ [[notes/Activation Checkpointing
> 256 GPUs—+ HSDP
> 1K GPUs> 50B params or seq > 16K+ CP / PP / TP

The parallelism dimensions map onto the tensor axes:

  • DP splits Batch
  • Context Parallel (CP) splits Sequence — see Attention
  • PP splits Layers
  • TP splits Dim

Source

CS231n 2025 Lec 11 slides ~103–147 (MFU/HFU definitions, published model table, ND parallelism grid diagram, Llama3-405B three-recipe table, scaling thresholds, summary slide). 2026 PDF not published — using 2025 fallback.