Bahdanau Attention

Seq2Seq originally compressed the full source sequence into one vector $c = h_{T}$ . Bahdanau attention replaces that bottleneck by letting the decoder look back at all encoder states at each step.

For decoder timestep $t$ :

Score each encoder state: $e_{t, i} = f_{a tt} (s_{t - 1}, h_{i})$
Normalize: $a_{t, *} = softmax (e_{t, *})$
Build context: $c_{t} = \sum_{i} a_{t, i} h_{i}$
Decode using $c_{t}$

This is usually called additive attention because the score comes from a learned MLP instead of a dot product. Modern Transformer attention keeps the same idea, but swaps the scoring function for scaled dot products and cleanly separates keys from values.

🛠️ Steven Gong

Bahdanau Attention

Graph View

Backlinks