Bahdanau Attention
Seq2Seq originally compressed the full source sequence into one vector . Bahdanau attention replaces that bottleneck by letting the decoder look back at all encoder states at each step.
For decoder timestep :
- Score each encoder state:
- Normalize:
- Build context:
- Decode using
This is usually called additive attention because the score comes from a learned MLP instead of a dot product. Modern Transformer attention keeps the same idea, but swaps the scoring function for scaled dot products and cleanly separates keys from values.