Weight Decay

Weight decay is a regularization technique that shrinks the weights toward zero by a multiplicative factor at each gradient step, equivalent to L2 regularization under plain SGD.

Decay as a general idea

There is also learning rate decay for Neural Network training, see Learning Rate.

L2 Equivalence Derivation

Start with an L2-regularized loss:

$\tilde{L}_{θ} (x, y) = L_{θ} (x, y) + \frac{λ}{2} ∥ θ ∥_{2}^{2}$

Take the gradient with respect to $θ$ :

$\nabla \tilde{L}_{θ} = \nabla L_{θ} + λ θ$

One gradient step:

$θ_{t} \leftarrow θ_{t - 1} - η (\nabla L_{θ_{t - 1}} + λ θ_{t - 1})$

Rearrange:

$θ_{t} \leftarrow (1 - η λ) θ_{t - 1} - η \nabla L_{θ_{t - 1}}$

where:

$η$ is the learning rate
$λ$ is the regularization strength
$(1 - η λ) < 1$ is the shrinkage factor applied each step

Intuition

Every gradient step, first shrink every weight toward zero by a fixed fraction, then apply the usual gradient step. Weights only grow if the loss gradient is strong enough to outrun the steady pull back to zero. That’s why weight decay is the MAP equivalent of a Gaussian prior on weights: small weights are assumed more plausible, and the data has to work to justify any weight being large. It stops any single weight from explaining too much on its own.

Hence “weight decay”: L2 regularization reinterpreted as a shrinkage step.

Equivalence breaks under adaptive optimizers

For Adam / RMSProp, L2 regularization and weight decay are not the same. The adaptive step rescales the gradient by $1/ \overset{v}{^}_{t}$ , so the L2 term gets rescaled too, dampening regularization on frequently-updated parameters. The weights you most want to regularize (big, fast-moving ones) end up regularized the least, which is the opposite of what you want.

AdamW fixes this by applying decay directly to the weights instead of through the gradient:

$θ_{t} \leftarrow (1 - η λ) θ_{t - 1} - η \cdot adam-step$

This is why modern training uses AdamW, not Adam + L2.

Effects

Prevents weights from growing unboundedly
Keeps the model in a low-norm region of parameter space, giving a smaller function class and better generalization
Acts as a Gaussian prior on weights in the MAP interpretation

Hyperparameters:

$λ$ : decay strength, typically $1 0^{- 4}$ to $1 0^{- 2}$
Often excluded from biases and layer-norm parameters

Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec10.pdf

🛠️ Steven Gong

Table of Contents

Weight Decay

L2 Equivalence Derivation

Effects

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Weight Decay

L2 Equivalence Derivation

Effects

Related

Graph View

Backlinks