Vanishing & Exploding Gradients

The vanishing gradient problem is when gradients shrink toward zero as they’re backpropagated through a deep network, so early layers stop learning. The exploding gradient problem is the mirror image: gradients grow without bound, causing wild weight updates and NaNs.

Why does this happen?

Backprop through layers multiplies Jacobians together. If each one has typical singular value , the gradient at layer 1 scales like .

  • : (vanish)
  • : (explode)

There is no stable middle, you sit on a knife’s edge unless you actively engineer .

The chain-rule intuition

A loss gradient w.r.t. an early-layer weight is a product:

where:

  • each is a weight matrix times the activation derivative
  • the final is just the input to layer 1

Multiplying such factors is a geometric process: small numbers shrink exponentially, large numbers grow exponentially. Linear depth, exponential consequences.

Common confusion: "isn't the weight gradient just the input?"

For a layer , the gradient of the loss w.r.t. that weight factors as:

where:

  • the local term really is β€œjust the input”, if activations are bounded, so is this factor
  • the upstream error is where the explosion lives

The upstream error itself recurses backward through every later layer:

So the weight matrices compounding into aren’t itself, they’re all the upstream weights baked into by the chain rule. Even if every and every , a depth- chain gives growth. One weight being β€œa bit larger than 1” compounds across depth.

Large inputs can inflate the local term (and in ReLU nets forward activations themselves grow with depth when weights are big), but the dominant mechanism is the Jacobian product on the backward pass.

From Jacobians to singular values

β€œSmall numbers” and β€œlarge numbers” is loose talk, these factors are matrices, not scalars. So what does it mean for a matrix to be β€œsmall” or β€œlarge”?

The chain rule produces a Jacobian at every layer (literally for a linear layer, with a nonlinearity). The gradient flowing backwards is:

At every layer the gradient gets multiplied by a matrix. The relevant question is: how much can multiplying by stretch a vector? That maximum stretching factor is exactly , the spectral norm of , equal to its largest singular value. We get the bound:

where:

  • is the largest singular value of the layer- Jacobian
  • if the typical is , the worst-case gradient at layer 1 scales like

Singular values aren’t a new concept beyond derivatives, they’re just the right way to measure the size of the Jacobian matrices the chain rule already gives you.

Concrete example: sigmoid

The sigmoid derivative peaks at 0.25 (when ) and is much smaller elsewhere. In a 10-layer net using sigmoids, even in the best case the activation-derivative product is at most . The gradient reaching layer 1 is roughly a million times smaller than the gradient at layer 10, so layer 1 effectively does not train.

This is the historical reason deep nets were considered β€œuntrainable” before ~2010. It wasn’t compute, it was sigmoid + bad init.

Exploding gradients

Same mechanism, opposite direction. If weight matrices have spectral norm , the product blows up:

  • activations like ReLU don’t cap the forward pass (no saturation on positives), so signals can grow
  • a weight matrix with largest singular value gives gradient growth
  • symptom: loss suddenly becomes NaN, or you see gradient norms in the millions in your logs

RNNs are the classic offender

RNNs apply the same weight matrix at every timestep, so backprop-through-time multiplies by itself times. The largest eigenvalue of controls everything:

  • β†’ explode
  • β†’ vanish
  • you essentially never land on exactly

This is why vanilla RNNs can’t learn long-range dependencies and why LSTM’s additive gated cell state was such a breakthrough: the gradient flows through addition, not repeated multiplication.

Saturated neurons (the activation side of vanishing)

A neuron is saturated when its activation derivative is ~0, so no gradient flows through it.

  • sigmoid saturates when is large, output flattens at 0 or 1, derivative β†’ 0
  • tanh saturates the same way (and is zero-centered, so slightly better, but the gradient still dies)
  • ReLU β€œdead neuron”: once a ReLU is pushed into the negative region by a bad weight update, it outputs 0 forever, gradient is 0 forever, and that unit never recovers. Often caused by a too-large learning rate or bad init

Dead ReLUs are permanent

There’s no gradient to push a dead unit back out of the zero region, so it stays silent for the rest of training. Permanent brain damage.

See Activation Function for the full menu.

How to diagnose

  • log per-layer gradient norms during training, vanishing: norms decay geometrically toward the input, exploding: norms spike or NaN
  • watch early-layer weight histograms over epochs, if they don’t move, gradients aren’t reaching them
  • loss plateaus immediately at a high value β†’ likely vanishing
  • loss is NaN after a few steps β†’ almost certainly exploding

How modern nets solve it

Each fix targets a different factor in the product:

FixWhat it does
Kaiming initScales so each layer’s Jacobian has singular values at initialization
ReLU / GELU / SiLUActivation derivative is (not ) on the active region, no per-layer shrinkage
BatchNorm / LayerNormRe-centers activations each layer, preventing drift into saturated regimes
Residual connections ()Gradient flows through the identity path: , so the worst case is β€œgradient passes through unchanged” instead of shrinking
LSTM gatingCell state has an additive path, forget gate keeps gradient alive across timesteps
Gradient clippingBandage for exploding gradients only, caps at a threshold, doesn’t help vanishing

The pattern

Prevent geometric decay/growth at the source (init, normalization, activation choice, residual paths). Clipping is the only β€œcleanup at the end” fix and only helps the exploding side.

References