Continuous Batching

Continuous batching serves multiple LLM requests through the same forward pass, batching at the token level every iteration rather than locking a batch of full sequences together. It’s one of vLLM’s two main contributions alongside Paged Attention.

What does static batching get wrong?

Standard batching forms a batch, runs it to completion, then forms the next. Fast requests wait for the slowest one to finish, GPU sits partially idle, and new requests queue behind the whole batch.

How it works

  • Every forward step, the active batch is whatever tokens are currently in-flight across all ongoing requests
  • Finished requests drop out mid-batch and free their slot immediately
  • New requests slot in the next step, no waiting for a batch window
step t:    [req A tok 5] [req B tok 12] [req C tok 3]
step t+1:  [req A tok 6] [req B done ]  [req C tok 4] [req D tok 1]  ← D joins
step t+2:  [req A tok 7] [req C tok 5] [req D tok 2]                 ← B's slot freed

Why it needs paged attention

Variable sequence lengths across in-flight requests means KV caches are all different sizes. Classic contiguous KV storage fragments badly. Paged attention lets each sequence’s KV cache live in scattered fixed-size blocks, so the scheduler can mix requests freely.

Impact

On inference-heavy workloads, continuous batching plus paged attention gives vLLM its headline ~24× throughput over naive HuggingFace serving Kwon et al. 2023.