Attention (Transformer)

Paged Attention

Main contribution of vLLM.

Traditional LLM inference maintains a contiguous Key-Value (KV) cache in GPU memory, which is inefficient as the sequence length grows.

Paper:

Other

PagedAttention attempts to optimize memory use by partitioning the KV cache into blocks that are accessed through a lookup table. Thus, the KV cache does not need to be stored in contiguous memory, and blocks are allocated as needed. The memory efficiency can increase GPU utilization on memory-bound workloads, so more inference batches can be supported.