Paged Attention
Main contribution of vLLM.
Traditional LLM inference maintains a contiguous Key-Value (KV) cache in GPU memory, which is inefficient as the sequence length grows.
Paper:
Other
PagedAttention attempts to optimize memory use by partitioning the KV cache into blocks that are accessed through a lookup table. Thus, the KV cache does not need to be stored in contiguous memory, and blocks are allocated as needed. The memory efficiency can increase GPU utilization on memory-bound workloads, so more inference batches can be supported.