Gradient Accumulation
Gradient accumulation simulates a large batch by summing gradients over multiple small batches before calling optimizer.step().
Why not just use a bigger batch?
The largest batch that fits in GPU memory is often smaller than the batch size that trains well. Accumulation lets you decouple the two: small micro-batches for memory, large effective batch for optimization.
Mechanics
for i, micro_batch in enumerate(loader):
loss = model(micro_batch) / accum_steps
loss.backward() # accumulates into .grad
if (i + 1) % accum_steps == 0:
optimizer.step()
optimizer.zero_grad()Effective batch size = micro_batch × accum_steps, but peak memory is only one micro-batch.
Yelp benchmark (ECE459 L23)
Training BERT-base on Yelp with physical batch=8, varying accumulation:
| Grad accum | Time (s) | Samples/s | Final accuracy |
|---|---|---|---|
| 1 | 538.37 | 5.56 | 0.621 |
| 8 | 501.89 | 5.98 | 0.554 |
| 32 | 429.70 | 6.98 | 0.347 |
| 1024 | 513.17 | 5.85 | 0.222 |
Throughput plateaus, accuracy collapses as effective batch grows past the sweet spot.
"Throughput goes up" is not enough
Without a validation signal you can silently destroy accuracy chasing samples/sec. There’s no universally-correct batch size: too small underfits, too large overfits or hits bad local minima.