Paged Attention

Main contribution of vLLM.

Traditional LLM inference maintains a contiguous Key-Value (KV) cache in GPU memory, which is inefficient as the sequence length grows.

Paper:

https://arxiv.org/pdf/2309.06180

Other

https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention

PagedAttention attempts to optimize memory use by partitioning the KV cache into blocks that are accessed through a lookup table. Thus, the KV cache does not need to be stored in contiguous memory, and blocks are allocated as needed. The memory efficiency can increase GPU utilization on memory-bound workloads, so more inference batches can be supported.

🛠️ Steven Gong

Paged Attention

Graph View

Backlinks