Attention (Transformer)
Attention mixes information by using query-key similarity to take a weighted sum of values.
Terms:
- Self-Attention
- Cross-Attention
- Multi-Head Attention
- Masked Attention
- Sparse Attention
- FlashAttention
- Paged Attention (used for faster compute)
- Multi-Head Latent Attention
- Multi-Head Attention
- Multi-Query Attention (modern)
- Grouped Query Attention
Confusion:
Intuition
Soft dictionary lookup. Each query asks βwhoβs relevant to me?β, each key advertises βhereβs what Iβm about,β each value carries the actual content. The softmax over picks a weighted blend of values.
Shapes
- Query (Q): what this position is looking for
- Key (K): what each position advertises
- Value (V): what each position contributes if selected
Input:
Project into queries, keys, values:
So:
Scores:
Apply softmax row-wise over the last dimension:
Then mix values:
Are
W_Q,W_K, andW_Vreally necessary? Why not use the raw embedding?Yes. The projections let the model learn separate subspaces for matching and for passing content. Otherwise the same representation would have to serve all three roles.
Why not make queries and keys the same matrix?
Because βwhat Iβm looking forβ and βwhat I offerβ are different roles. Dating analogy: the traits you want in a partner are not the same as the traits you advertise about yourself. Separate and let attention model that asymmetry; tying them would make matching less expressive.
Why
\sqrt{d_k}?Without it, dot products grow with dimension ( has variance proportional to for random unit-ish vectors). Then softmax saturates into a near one-hot and gradients vanish.
Dividing by keeps logit variance around 1, so softmax stays in the informative regime where it can still be nudged during training.
Another vid
I was really confused about QKV. https://www.reddit.com/r/MachineLearning/comments/19ewfm9/d_attention_mystery_which_is_which_q_k_or_v/ https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms