Prompt Caching

Prompt caching reuses the KV cache for a shared prefix across requests, skipping the prefill compute for tokens that were already processed.

Where does the win come from?

Most production LLM calls share a long system prompt, instructions, or few-shot examples. Computing their KV cache once per version and reusing it across every subsequent request with the same prefix eliminates redundant prefill, which is often the dominant cost for short outputs.

Mechanics

Hash the prompt prefix, look up its KV cache
On hit: skip attention+FFN for those tokens, pick up from the cached state
On miss: compute as normal and store the result for future requests
Typically scoped per-tenant or per-API-key to avoid leaking prompts

Where it’s deployed

Anthropic: explicit cache-control breakpoints, 90% discount on cached tokens docs
OpenAI: automatic for prefixes ≥ 1024 tokens, 50% discount
vLLM: --enable-prefix-caching, works at block granularity on top of paged attention

Trade-offs

Cache lifetime: Anthropic’s is 5 minutes (or 1 hour with extended TTL), idle caches get evicted
Cache invalidation: any change to the prefix busts the cache for everything after it, so prefixes should be stable
Memory pressure: keeping caches around competes with active request KV

🛠️ Steven Gong

Table of Contents

Prompt Caching

Mechanics

Where it’s deployed

Trade-offs

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Prompt Caching

Mechanics

Where it’s deployed

Trade-offs

Related

Graph View

Backlinks