Paged Attention
Paged attention manages the KV cache in fixed-size blocks accessed through a page table, borrowing the idea from OS virtual memory Kwon et al. 2023. Main contribution of vLLM alongside continuous batching.
What's wrong with contiguous KV?
Traditional inference pre-allocates a contiguous KV buffer per request, sized to the worst-case sequence length. Short requests waste the tail, long requests need to be copied when they grow, and mixed-length batches fragment GPU memory badly.
How paging fixes it
- Split each sequence’s KV cache into fixed-size blocks (e.g. 16 tokens per block)
- Each sequence has a page table mapping logical positions to physical blocks
- Blocks live anywhere in GPU memory, not contiguously
- Blocks are allocated on demand as sequences grow
Outcome: near-zero fragmentation, much larger effective batch sizes, and prefixes can be shared across sequences (copy-on-write for branching, direct sharing for identical prefixes which powers prompt caching).
Why it enables continuous batching
Continuous batching wants to mix requests of wildly different lengths in one forward pass. Contiguous KV can’t handle that without copying. Page-based KV just hands each request a different set of physical blocks and uses the page table to stitch them together during attention.