LLM Optimization

LLM optimization is the grab-bag of techniques that run large language model training and inference faster or cheaper. Built on top of GPU Optimization, every technique below attacks one specific bottleneck.

Why so many knobs?

Different workloads hit different bottlenecks (weight bandwidth, KV memory, attention traffic, sequential latency, batching). Each technique trades one resource against another, so the right mix depends on which bottleneck you’re actually sitting on. Profile first.

Guiding fact

LLM inference is memory-bandwidth-bound, not compute-bound: weights stream from HBM every single token. That single fact drives most of the inference techniques below (shrink weights, reuse KV, batch more requests per HBM sweep).

Inference bottlenecks

  • Weight bandwidth, every token re-reads all weights from HBM
  • KV memory, past K/V grows linearly with sequence, fragments under batching
  • Attention HBM traffic, the NĂ—N attention matrix materializes in HBM
  • Sequential decode latency, one token at a time, no parallelism across positions
  • GPU idle between requests, naive batching stalls when one request finishes early
  • Model too big for one GPU, >80GB needs sharding
  1. Weight bandwidth:
  2. KV memory:
  3. Attention HBM traffic:
    • FlashAttention (tile attention into SRAM, never materialize NĂ—N)
  4. Sequential decode latency:
  5. GPU idle between requests:
  6. Model too big:

The big serving systems (vLLM, TensorRT-LLM, SGLang) stack most of these.

Training bottlenecks

Training is usually GPU-memory-bound (activations + optimizer state + gradients all live on GPU at once), not compute-bound. Hugging Face’s single-GPU training guide [Fac23b,c] breaks transformer ops into: tensor contractions (matmul, compute-heavy), statistical normalizations (map + reduce), element-wise ops (dropout, biases, cheap).

  1. Activation memory:
  2. Effective batch size without the memory:
  3. Compute throughput:
  4. Host→GPU transfer:
  5. Model too big:
  6. Batch size tuning, binary-search for the largest batch that doesn’t OOM

Batch size on ecetesla0

bert-large-uncased (340 MB) OOM’d at batch=1, so the run is on bert-base-uncased (110 MB):

BatchTime (s)Samples/sMem (MB)Util (%)
1109.624.67328143.1
285.825.97339144.6
472.187.09461360.6
866.707.68706992.9

Batch 9 OOM’d. Throughput keeps climbing with GPU utilization right up to the memory wall.

Measure accuracy, not just throughput

Grad-accumulation and large effective batches can silently tank validation accuracy. See the Yelp table on Gradient Accumulation for what chasing samples/sec without a val set looks like.

Hallucinations are yours to verify

LLMs confidently return made-up output. A lawyer famously used ChatGPT for “research” and got fake case citations. “The software said so” doesn’t absolve a civil engineer when the building falls down.