Gated Recurrent Unit (GRU)

Simpler alternative to the LSTM (Cho et al. 2014). Merges the input and forget gates into a single update gate , and drops the separate cell state — only the hidden state propagates.

Equations (CS231n 2024 Lec 7)

r_t &= \sigma(W_{xr} x_t + W_{hr} h_{t-1} + b_r) \quad &\text{(reset gate)} \\ z_t &= \sigma(W_{xz} x_t + W_{hz} h_{t-1} + b_z) \quad &\text{(update gate)} \\ \tilde{h}_t &= \tanh(W_{xh} x_t + W_{hh}(r_t \odot h_{t-1}) + b_h) \\ h_t &= z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t \end{aligned}$$ The convex combination $h_t = z_t \odot h_{t-1} + (1-z_t) \odot \tilde{h}_t$ is the key gradient-flow path: when $z_t \approx 1$, the old state passes through unchanged (like an LSTM forget-gate-open). Fewer parameters than LSTM for comparable performance on many tasks. ### Source CS231n 2024 Lec 7 slide 122 (GRU equations from Cho et al. 2014, shown alongside LSTM variants MUT1/MUT2/MUT3 from Jozefowicz et al. 2015). ### Related - [[notes/Long Short-Term Memory|LSTM]] - [[notes/Recurrent Neural Network|RNN]]