Long Short-Term Memory (LSTM)
LSTM (Hochreiter & Schmidhuber, 1997) is the standard fix for vanilla RNNβs vanishing-gradient problem. It adds a cell state alongside the hidden state , with multiplicative gates that control reads, writes, and erases. Legendary Colah explainer.
Why?
Vanilla RNN backprop multiplies by at every timestep β gradients vanish or explode geometrically. LSTM gives the cell state an additive update path with no per-step matrix multiply: . Backprop from to is just elementwise multiply by . If , gradient flows uninterrupted across many timesteps β the same trick as ResNetβs identity skip.
Equations (CS231n 2024 Lec 7)
Stack the previous hidden state and current input, multiply by one big weight matrix of shape , then split the -dim output into four chunks and pass each through its own nonlinearity:
| gate | nonlinearity | role |
|---|---|---|
| β input gate | sigmoid | whether to write to cell |
| β forget gate | sigmoid | whether to erase cell |
| β output gate | sigmoid | how much of cell to reveal as |
| β gate gate | tanh | how much to write to cell |
Why gradient flow works
Backprop from to through is just elementwise multiplication by β no matrix multiply by . Across the whole unrolled chain, (elementwise). If the forget gate stays near 1, gradient is preserved.
This doesnβt guarantee no vanishing/exploding, but it makes long-distance dependencies much easier to learn than in a vanilla RNN. The structural analogue to ResNet β additive skip path that bypasses the nonlinear transformation β is not a coincidence.
Highway Networks
In between vanilla RNN and LSTM lies the Highway Network (Srivastava et al. ICML 2015): where is a learned gate. Same gating idea applied to feedforward depth.
Source
CS231n 2024 Lec 7 slides 96, 111β121 (LSTM equations, gate roles, gradient flow, ResNet analogy, Highway Networks).