Continuous Batching

Continuous batching serves multiple LLM requests through the same forward pass, batching at the token level every iteration rather than locking a batch of full sequences together. It’s one of vLLM’s two main contributions alongside Paged Attention.

What does static batching get wrong?

Standard batching forms a batch, runs it to completion, then forms the next. Fast requests wait for the slowest one to finish, GPU sits partially idle, and new requests queue behind the whole batch.

How it works

Every forward step, the active batch is whatever tokens are currently in-flight across all ongoing requests
Finished requests drop out mid-batch and free their slot immediately
New requests slot in the next step, no waiting for a batch window

step t:    [req A tok 5] [req B tok 12] [req C tok 3]
step t+1:  [req A tok 6] [req B done ]  [req C tok 4] [req D tok 1]  ← D joins
step t+2:  [req A tok 7] [req C tok 5] [req D tok 2]                 ← B's slot freed

Why it needs paged attention

Variable sequence lengths across in-flight requests means KV caches are all different sizes. Classic contiguous KV storage fragments badly. Paged attention lets each sequence’s KV cache live in scattered fixed-size blocks, so the scheduler can mix requests freely.

Impact

On inference-heavy workloads, continuous batching plus paged attention gives vLLM its headline ~24× throughput over naive HuggingFace serving Kwon et al. 2023.

🛠️ Steven Gong

Table of Contents

Continuous Batching

How it works

Why it needs paged attention

Impact

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Continuous Batching

How it works

Why it needs paged attention

Impact

Related

Graph View

Backlinks