Gradient Accumulation

Gradient accumulation simulates a large batch by summing gradients over multiple small batches before calling optimizer.step().

Why not just use a bigger batch?

The largest batch that fits in GPU memory is often smaller than the batch size that trains well. Accumulation lets you decouple the two: small micro-batches for memory, large effective batch for optimization.

Mechanics

for i, micro_batch in enumerate(loader):
    loss = model(micro_batch) / accum_steps
    loss.backward()                    # accumulates into .grad
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size = micro_batch × accum_steps, but peak memory is only one micro-batch.

Yelp benchmark (ECE459 L23)

Training BERT-base on Yelp with physical batch=8, varying accumulation:

Grad accum	Time (s)	Samples/s	Final accuracy
1	538.37	5.56	0.621
8	501.89	5.98	0.554
32	429.70	6.98	0.347
1024	513.17	5.85	0.222

Throughput plateaus, accuracy collapses as effective batch grows past the sweet spot.

"Throughput goes up" is not enough

Without a validation signal you can silently destroy accuracy chasing samples/sec. There’s no universally-correct batch size: too small underfits, too large overfits or hits bad local minima.

🛠️ Steven Gong

Table of Contents

Gradient Accumulation

Mechanics

Yelp benchmark (ECE459 L23)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Gradient Accumulation

Mechanics

Yelp benchmark (ECE459 L23)

Related

Graph View

Backlinks