LLM Optimization
LLM optimization is the grab-bag of techniques that run large language model training and inference faster or cheaper. Built on top of GPU Optimization, every technique below attacks one specific bottleneck.
Why so many knobs?
Different workloads hit different bottlenecks (weight bandwidth, KV memory, attention traffic, sequential latency, batching). Each technique trades one resource against another, so the right mix depends on which bottleneck you’re actually sitting on. Profile first.
Guiding fact
LLM inference is memory-bandwidth-bound, not compute-bound: weights stream from HBM every single token. That single fact drives most of the inference techniques below (shrink weights, reuse KV, batch more requests per HBM sweep).
Inference bottlenecks
- Weight bandwidth, every token re-reads all weights from HBM
- KV memory, past K/V grows linearly with sequence, fragments under batching
- Attention HBM traffic, the NĂ—N attention matrix materializes in HBM
- Sequential decode latency, one token at a time, no parallelism across positions
- GPU idle between requests, naive batching stalls when one request finishes early
- Model too big for one GPU, >80GB needs sharding
- Weight bandwidth:
- Quantization (fp16 → int8 / int4, 2-8× smaller weights)
- Mixed Precision (FP16/BF16 vs FP32)
- KV memory:
- Paged Attention (OS-style paging, vLLM’s memory half)
- KV Cache quantization
- Attention HBM traffic:
- FlashAttention (tile attention into SRAM, never materialize NĂ—N)
- Sequential decode latency:
- Speculative Decoding (draft model proposes tokens, big model verifies in one pass)
- GPU idle between requests:
- Continuous Batching (iteration-level batching, vLLM’s throughput half)
- Prompt Caching (reuse KV for shared system-prompt prefix)
- Model too big:
- Tensor Parallelism (shard each layer, all-reduce per layer)
- Pipeline Parallelism (shard by layer, pipeline bubbles)
The big serving systems (vLLM, TensorRT-LLM, SGLang) stack most of these.
Training bottlenecks
Training is usually GPU-memory-bound (activations + optimizer state + gradients all live on GPU at once), not compute-bound. Hugging Face’s single-GPU training guide [Fac23b,c] breaks transformer ops into: tensor contractions (matmul, compute-heavy), statistical normalizations (map + reduce), element-wise ops (dropout, biases, cheap).
- Activation memory:
- Gradient Checkpointing (recompute on backward, ~½ activation memory for ~20% more time)
- Effective batch size without the memory:
- Gradient Accumulation (sum grads over micro-batches, can hurt accuracy past the sweet spot)
- Compute throughput:
- Mixed Precision (FP16/BF16 matmul on Tensor Cores, 2-4Ă— throughput)
- Host→GPU transfer:
- Pinned Memory (page-locked host RAM, enables async H2D copies)
- Model too big:
- Batch size tuning, binary-search for the largest batch that doesn’t OOM
Batch size on ecetesla0
bert-large-uncased (340 MB) OOM’d at batch=1, so the run is on bert-base-uncased (110 MB):
| Batch | Time (s) | Samples/s | Mem (MB) | Util (%) |
|---|---|---|---|---|
| 1 | 109.62 | 4.67 | 3281 | 43.1 |
| 2 | 85.82 | 5.97 | 3391 | 44.6 |
| 4 | 72.18 | 7.09 | 4613 | 60.6 |
| 8 | 66.70 | 7.68 | 7069 | 92.9 |
Batch 9 OOM’d. Throughput keeps climbing with GPU utilization right up to the memory wall.
Measure accuracy, not just throughput
Grad-accumulation and large effective batches can silently tank validation accuracy. See the Yelp table on Gradient Accumulation for what chasing samples/sec without a val set looks like.
Hallucinations are yours to verify
LLMs confidently return made-up output. A lawyer famously used ChatGPT for “research” and got fake case citations. “The software said so” doesn’t absolve a civil engineer when the building falls down.