Recurrent Neural Network

Long Short-Term Memory (LSTM)

LSTM (Hochreiter & Schmidhuber, 1997) is the standard fix for vanilla RNN’s vanishing-gradient problem. It adds a cell state alongside the hidden state , with multiplicative gates that control reads, writes, and erases. Legendary Colah explainer.

Why?

Vanilla RNN backprop multiplies by at every timestep β€” gradients vanish or explode geometrically. LSTM gives the cell state an additive update path with no per-step matrix multiply: . Backprop from to is just elementwise multiply by . If , gradient flows uninterrupted across many timesteps β€” the same trick as ResNet’s identity skip.

Equations (CS231n 2024 Lec 7)

Stack the previous hidden state and current input, multiply by one big weight matrix of shape , then split the -dim output into four chunks and pass each through its own nonlinearity:

gatenonlinearityrole
β€” input gatesigmoidwhether to write to cell
β€” forget gatesigmoidwhether to erase cell
β€” output gatesigmoidhow much of cell to reveal as
β€” gate gatetanhhow much to write to cell

Why gradient flow works

Backprop from to through is just elementwise multiplication by β€” no matrix multiply by . Across the whole unrolled chain, (elementwise). If the forget gate stays near 1, gradient is preserved.

This doesn’t guarantee no vanishing/exploding, but it makes long-distance dependencies much easier to learn than in a vanilla RNN. The structural analogue to ResNet β€” additive skip path that bypasses the nonlinear transformation β€” is not a coincidence.

Highway Networks

In between vanilla RNN and LSTM lies the Highway Network (Srivastava et al. ICML 2015): where is a learned gate. Same gating idea applied to feedforward depth.

Source

CS231n 2024 Lec 7 slides 96, 111–121 (LSTM equations, gate roles, gradient flow, ResNet analogy, Highway Networks).