Gradient Accumulation

Gradient accumulation simulates a large batch by summing gradients over multiple small batches before calling optimizer.step().

Why not just use a bigger batch?

The largest batch that fits in GPU memory is often smaller than the batch size that trains well. Accumulation lets you decouple the two: small micro-batches for memory, large effective batch for optimization.

Mechanics

for i, micro_batch in enumerate(loader):
    loss = model(micro_batch) / accum_steps
    loss.backward()                    # accumulates into .grad
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size = micro_batch × accum_steps, but peak memory is only one micro-batch.

Yelp benchmark (ECE459 L23)

Training BERT-base on Yelp with physical batch=8, varying accumulation:

Grad accumTime (s)Samples/sFinal accuracy
1538.375.560.621
8501.895.980.554
32429.706.980.347
1024513.175.850.222

Throughput plateaus, accuracy collapses as effective batch grows past the sweet spot.

"Throughput goes up" is not enough

Without a validation signal you can silently destroy accuracy chasing samples/sec. There’s no universally-correct batch size: too small underfits, too large overfits or hits bad local minima.