Vanishing & Exploding Gradients
The vanishing gradient problem is when gradients shrink toward zero as theyβre backpropagated through a deep network, so early layers stop learning. The exploding gradient problem is the mirror image: gradients grow without bound, causing wild weight updates and NaNs.
Why does this happen?
Backprop through layers multiplies Jacobians together. If each one has typical singular value , the gradient at layer 1 scales like .
- : (vanish)
- : (explode)
There is no stable middle, you sit on a knifeβs edge unless you actively engineer .
The chain-rule intuition
A loss gradient w.r.t. an early-layer weight is a product:
where:
- each is a weight matrix times the activation derivative
- the final is just the input to layer 1
Multiplying such factors is a geometric process: small numbers shrink exponentially, large numbers grow exponentially. Linear depth, exponential consequences.
Common confusion: "isn't the weight gradient just the input?"
For a layer , the gradient of the loss w.r.t. that weight factors as:
where:
- the local term really is βjust the inputβ, if activations are bounded, so is this factor
- the upstream error is where the explosion lives
The upstream error itself recurses backward through every later layer:
So the weight matrices compounding into arenβt itself, theyβre all the upstream weights baked into by the chain rule. Even if every and every , a depth- chain gives growth. One weight being βa bit larger than 1β compounds across depth.
Large inputs can inflate the local term (and in ReLU nets forward activations themselves grow with depth when weights are big), but the dominant mechanism is the Jacobian product on the backward pass.
From Jacobians to singular values
βSmall numbersβ and βlarge numbersβ is loose talk, these factors are matrices, not scalars. So what does it mean for a matrix to be βsmallβ or βlargeβ?
The chain rule produces a Jacobian at every layer (literally for a linear layer, with a nonlinearity). The gradient flowing backwards is:
At every layer the gradient gets multiplied by a matrix. The relevant question is: how much can multiplying by stretch a vector? That maximum stretching factor is exactly , the spectral norm of , equal to its largest singular value. We get the bound:
where:
- is the largest singular value of the layer- Jacobian
- if the typical is , the worst-case gradient at layer 1 scales like
Singular values arenβt a new concept beyond derivatives, theyβre just the right way to measure the size of the Jacobian matrices the chain rule already gives you.
Concrete example: sigmoid
The sigmoid derivative peaks at 0.25 (when ) and is much smaller elsewhere. In a 10-layer net using sigmoids, even in the best case the activation-derivative product is at most . The gradient reaching layer 1 is roughly a million times smaller than the gradient at layer 10, so layer 1 effectively does not train.
This is the historical reason deep nets were considered βuntrainableβ before ~2010. It wasnβt compute, it was sigmoid + bad init.
Exploding gradients
Same mechanism, opposite direction. If weight matrices have spectral norm , the product blows up:
- activations like ReLU donβt cap the forward pass (no saturation on positives), so signals can grow
- a weight matrix with largest singular value gives gradient growth
- symptom: loss suddenly becomes
NaN, or you see gradient norms in the millions in your logs
RNNs are the classic offender
RNNs apply the same weight matrix at every timestep, so backprop-through-time multiplies by itself times. The largest eigenvalue of controls everything:
- β explode
- β vanish
- you essentially never land on exactly
This is why vanilla RNNs canβt learn long-range dependencies and why LSTMβs additive gated cell state was such a breakthrough: the gradient flows through addition, not repeated multiplication.
Saturated neurons (the activation side of vanishing)
A neuron is saturated when its activation derivative is ~0, so no gradient flows through it.
- sigmoid saturates when is large, output flattens at 0 or 1, derivative β 0
- tanh saturates the same way (and is zero-centered, so slightly better, but the gradient still dies)
- ReLU βdead neuronβ: once a ReLU is pushed into the negative region by a bad weight update, it outputs 0 forever, gradient is 0 forever, and that unit never recovers. Often caused by a too-large learning rate or bad init
Dead ReLUs are permanent
Thereβs no gradient to push a dead unit back out of the zero region, so it stays silent for the rest of training. Permanent brain damage.
See Activation Function for the full menu.
How to diagnose
- log per-layer gradient norms during training, vanishing: norms decay geometrically toward the input, exploding: norms spike or
NaN - watch early-layer weight histograms over epochs, if they donβt move, gradients arenβt reaching them
- loss plateaus immediately at a high value β likely vanishing
- loss is
NaNafter a few steps β almost certainly exploding
How modern nets solve it
Each fix targets a different factor in the product:
| Fix | What it does |
|---|---|
| Kaiming init | Scales so each layerβs Jacobian has singular values at initialization |
| ReLU / GELU / SiLU | Activation derivative is (not ) on the active region, no per-layer shrinkage |
| BatchNorm / LayerNorm | Re-centers activations each layer, preventing drift into saturated regimes |
| Residual connections () | Gradient flows through the identity path: , so the worst case is βgradient passes through unchangedβ instead of shrinking |
| LSTM gating | Cell state has an additive path, forget gate keeps gradient alive across timesteps |
| Gradient clipping | Bandage for exploding gradients only, caps at a threshold, doesnβt help vanishing |
The pattern
Prevent geometric decay/growth at the source (init, normalization, activation choice, residual paths). Clipping is the only βcleanup at the endβ fix and only helps the exploding side.
Related
- Backpropagation
- Kaiming Initialization
- Batch Normalization
- Activation Function
- Gradient Clipping
- LSTM
- RNN