Weight Decay
Weight decay is a regularization technique that shrinks the weights toward zero by a multiplicative factor at each gradient step, equivalent to L2 regularization under plain SGD.
Decay as a general idea
There is also learning rate decay for Neural Network training, see Learning Rate.
L2 Equivalence Derivation
Start with an L2-regularized loss:
Take the gradient with respect to :
One gradient step:
Rearrange:
where:
- is the learning rate
- is the regularization strength
- is the shrinkage factor applied each step
Intuition
Every gradient step, first shrink every weight toward zero by a fixed fraction, then apply the usual gradient step. Weights only grow if the loss gradient is strong enough to outrun the steady pull back to zero. That’s why weight decay is the MAP equivalent of a Gaussian prior on weights: small weights are assumed more plausible, and the data has to work to justify any weight being large. It stops any single weight from explaining too much on its own.
Hence “weight decay”: L2 regularization reinterpreted as a shrinkage step.
Equivalence breaks under adaptive optimizers
For Adam / RMSProp, L2 regularization and weight decay are not the same. The adaptive step rescales the gradient by , so the L2 term gets rescaled too, dampening regularization on frequently-updated parameters. The weights you most want to regularize (big, fast-moving ones) end up regularized the least, which is the opposite of what you want.
AdamW fixes this by applying decay directly to the weights instead of through the gradient:
This is why modern training uses AdamW, not Adam + L2.
Effects
- Prevents weights from growing unboundedly
- Keeps the model in a low-norm region of parameter space, giving a smaller function class and better generalization
- Acts as a Gaussian prior on weights in the MAP interpretation
Hyperparameters:
- : decay strength, typically to
- Often excluded from biases and layer-norm parameters
Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec10.pdf