LLM Optimization

LLM optimization is the grab-bag of techniques that run large language model training and inference faster or cheaper. Built on top of GPU Optimization, every technique below attacks one specific bottleneck.

Why so many knobs?

Different workloads hit different bottlenecks (weight bandwidth, KV memory, attention traffic, sequential latency, batching). Each technique trades one resource against another, so the right mix depends on which bottleneck you’re actually sitting on. Profile first.

Guiding fact

LLM inference is memory-bandwidth-bound, not compute-bound: weights stream from HBM every single token. That single fact drives most of the inference techniques below (shrink weights, reuse KV, batch more requests per HBM sweep).

Inference bottlenecks

Weight bandwidth, every token re-reads all weights from HBM

KV memory, past K/V grows linearly with sequence, fragments under batching

Attention HBM traffic, the N×N attention matrix materializes in HBM

Sequential decode latency, one token at a time, no parallelism across positions

GPU idle between requests, naive batching stalls when one request finishes early

Model too big for one GPU, >80GB needs sharding

Weight bandwidth:
- Quantization (fp16 → int8 / int4, 2-8× smaller weights)
- Mixed Precision (FP16/BF16 vs FP32)
KV memory:
- Paged Attention (OS-style paging, vLLM’s memory half)
- KV Cache quantization
Attention HBM traffic:
- FlashAttention (tile attention into SRAM, never materialize N×N)
Sequential decode latency:
- Speculative Decoding (draft model proposes $k$ tokens, big model verifies in one pass)
GPU idle between requests:
- Continuous Batching (iteration-level batching, vLLM’s throughput half)
- Prompt Caching (reuse KV for shared system-prompt prefix)
Model too big:
- Tensor Parallelism (shard each layer, all-reduce per layer)
- Pipeline Parallelism (shard by layer, pipeline bubbles)

The big serving systems (vLLM, TensorRT-LLM, SGLang) stack most of these.

Training bottlenecks

Training is usually GPU-memory-bound (activations + optimizer state + gradients all live on GPU at once), not compute-bound. Hugging Face’s single-GPU training guide [Fac23b,c] breaks transformer ops into: tensor contractions (matmul, compute-heavy), statistical normalizations (map + reduce), element-wise ops (dropout, biases, cheap).

Activation memory:
- Gradient Checkpointing (recompute on backward, ~½ activation memory for ~20% more time)
Effective batch size without the memory:
- Gradient Accumulation (sum grads over micro-batches, can hurt accuracy past the sweet spot)
Compute throughput:
- Mixed Precision (FP16/BF16 matmul on Tensor Cores, 2-4× throughput)
Host→GPU transfer:
- Pinned Memory (page-locked host RAM, enables async H2D copies)
Model too big:
- Tensor Parallelism
- Pipeline Parallelism
Batch size tuning, binary-search for the largest batch that doesn’t OOM

Batch size on ecetesla0

bert-large-uncased (340 MB) OOM’d at batch=1, so the run is on bert-base-uncased (110 MB):

Batch	Time (s)	Samples/s	Mem (MB)	Util (%)
1	109.62	4.67	3281	43.1
2	85.82	5.97	3391	44.6
4	72.18	7.09	4613	60.6
8	66.70	7.68	7069	92.9

Batch 9 OOM’d. Throughput keeps climbing with GPU utilization right up to the memory wall.

Measure accuracy, not just throughput

Grad-accumulation and large effective batches can silently tank validation accuracy. See the Yelp table on Gradient Accumulation for what chasing samples/sec without a val set looks like.

Hallucinations are yours to verify

LLMs confidently return made-up output. A lawyer famously used ChatGPT for “research” and got fake case citations. “The software said so” doesn’t absolve a civil engineer when the building falls down.

🛠️ Steven Gong

Table of Contents

LLM Optimization

Batch size on ecetesla0

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

LLM Optimization

Batch size on ecetesla0

Related

Graph View

Backlinks