Attention

Bahdanau Attention

Seq2Seq originally compressed the full source sequence into one vector . Bahdanau attention replaces that bottleneck by letting the decoder look back at all encoder states at each step.

For decoder timestep :

  1. Score each encoder state:
  2. Normalize:
  3. Build context:
  4. Decode using

This is usually called additive attention because the score comes from a learned MLP instead of a dot product. Modern Transformer attention keeps the same idea, but swaps the scoring function for scaled dot products and cleanly separates keys from values.