Long Short-Term Memory (LSTM)

LSTM (Hochreiter & Schmidhuber, 1997) is the standard fix for vanilla RNN’s vanishing-gradient problem. It adds a cell state $c_{t}$ alongside the hidden state $h_{t}$ , with multiplicative gates that control reads, writes, and erases. Legendary Colah explainer.

Why?

Vanilla RNN backprop multiplies by $W_{hh}$ at every timestep — gradients vanish or explode geometrically. LSTM gives the cell state an additive update path with no per-step matrix multiply: $c_{t} = f ⊙ c_{t - 1} + i ⊙ g$ . Backprop from $c_{t}$ to $c_{t - 1}$ is just elementwise multiply by $f$ . If $f \approx 1$ , gradient flows uninterrupted across many timesteps — the same trick as ResNet’s identity skip.

Equations (CS231n 2024 Lec 7)

Stack the previous hidden state and current input, multiply by one big weight matrix of shape $4 h \times 2 h$ , then split the $4 h$ -dim output into four chunks and pass each through its own nonlinearity:

$i f o g = σ σ σ tanh W (h_{t - 1} x_{t})$

$c_{t} = f ⊙ c_{t - 1} + i ⊙ g$ $h_{t} = o ⊙ tanh (c_{t})$

gate	nonlinearity	role
$i$ — input gate	sigmoid	whether to write to cell
$f$ — forget gate	sigmoid	whether to erase cell
$o$ — output gate	sigmoid	how much of cell to reveal as $h_{t}$
$g$ — gate gate	tanh	how much to write to cell

Why gradient flow works

Backprop from $c_{t}$ to $c_{t - 1}$ through $c_{t} = f ⊙ c_{t - 1} + i ⊙ g$ is just elementwise multiplication by $f$ — no matrix multiply by $W$ . Across the whole unrolled chain, $\partial c_{T} / \partial c_{0} = \prod_{t} f_{t}$ (elementwise). If the forget gate stays near 1, gradient is preserved.

This doesn’t guarantee no vanishing/exploding, but it makes long-distance dependencies much easier to learn than in a vanilla RNN. The structural analogue to ResNet — additive skip path that bypasses the nonlinear transformation — is not a coincidence.

Highway Networks

In between vanilla RNN and LSTM lies the Highway Network (Srivastava et al. ICML 2015): $y = g ⊙ H (x, W_{H}) + (1 - g) ⊙ x$ where $g = T (x, W_{T})$ is a learned gate. Same gating idea applied to feedforward depth.

Source

CS231n 2024 Lec 7 slides 96, 111–121 (LSTM equations, gate roles, gradient flow, ResNet analogy, Highway Networks).

🛠️ Steven Gong

Table of Contents

Long Short-Term Memory (LSTM)

Equations (CS231n 2024 Lec 7)

Why gradient flow works

Highway Networks

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Long Short-Term Memory (LSTM)

Equations (CS231n 2024 Lec 7)

Why gradient flow works

Highway Networks

Source

Related

Graph View

Backlinks