Prompt Caching
Prompt caching reuses the KV cache for a shared prefix across requests, skipping the prefill compute for tokens that were already processed.
Where does the win come from?
Most production LLM calls share a long system prompt, instructions, or few-shot examples. Computing their KV cache once per version and reusing it across every subsequent request with the same prefix eliminates redundant prefill, which is often the dominant cost for short outputs.
Mechanics
- Hash the prompt prefix, look up its KV cache
- On hit: skip attention+FFN for those tokens, pick up from the cached state
- On miss: compute as normal and store the result for future requests
- Typically scoped per-tenant or per-API-key to avoid leaking prompts
Where it’s deployed
- Anthropic: explicit cache-control breakpoints, 90% discount on cached tokens docs
- OpenAI: automatic for prefixes ≥ 1024 tokens, 50% discount
- vLLM:
--enable-prefix-caching, works at block granularity on top of paged attention
Trade-offs
- Cache lifetime: Anthropic’s is 5 minutes (or 1 hour with extended TTL), idle caches get evicted
- Cache invalidation: any change to the prefix busts the cache for everything after it, so prefixes should be stable
- Memory pressure: keeping caches around competes with active request KV