Attention (Transformer)

Terms:

Self-Attention
Masked Attention
Sparse Attention
Flash Attention
Paged Attention (used for faster compute)
Multi-Head Latent Attention
Multi-Head Attention
Multi-Query Attention (modern)
Grouped Query Attention

Confusion:

https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms

$A tt e n t i o n (Q, K, V) = so f t ma x (\frac{Q K ^{T}}{d _{k}}) V$ In matrix form

softmax (\frac{⟨ q _{1} , k _{1} ⟩}{d}) v_{1}^{T} + \dots + softmax (\frac{⟨ q _{1} , k _{n} ⟩}{d}) v_{n}^{T} ⋮ softmax (\frac{⟨ q _{m} , k _{1} ⟩}{d}) v_{1}^{T} + \dots + softmax (\frac{⟨ q _{m} , k _{n} ⟩}{d}) v_{n}^{T} \in R^{m \times d}

Attention mechanism

Query roughly speaking is what am I looking for.
Key is what I represent
Value is what I actually contain
Query (Q): “What am I looking for?”
Key (K): “What do I have?”
Value (V): “What do I give you if you choose me?”

https://www.youtube.com/watch?v=bCz4OMemCcA

Another vid

https://www.youtube.com/watch?v=QCJQG4DuHT0

I was really confused about QKV. https://www.reddit.com/r/MachineLearning/comments/19ewfm9/d_attention_mystery_which_is_which_q_k_or_v/ https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms

Like conceptually, there are no physical properties that allow us to distinguish Q from K. Like for all we known, Q is K and K is Q.

Query (Q): “What am I looking for?”
Key (K): “What do I have?”
Value (V): “What do I give you if you choose me?”

Think of a library:

Query = what you want to read about
Keys = the summary on the card catalog for each book
Values = the full content of each book

Question

in attention, is W_Q, W_K and W_V really necessary? Just use the raw embedding as your Q, K and V. the embedding is going to change too anyways

$Q = X W_{Q} , K = X W_{K} , V = X W_{V} $

Inputs:
- Query $Q \in R^{m \times d}$
- Value $V \in R^{n \times d}$
- Key $K \in R^{L \times C}$

$V = v_{1}^{T} ⋮ v_{n}^{T} \in R^{n \times d}, K = k_{1}^{T} ⋮ k_{n}^{T} \in R^{n \times d}, Q = q_{1}^{T} ⋮ q_{m}^{T} \in R^{m \times d}$

Output: an $m \times d$ matrix

The dot product of the query and key tells you how well the key and query are aligned.

Then (Softmax operation is row-wise, i.e., $softmax (z)_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{n} e ^{z_{j}}}$ ):

What is $d_{k}$ ? I think that is the number of dimensions

It’s just a scaling factor
I asked the professor and he said empirically, it gives the best performance

Attention Layer

This is cross-attention

KV Cache

🛠️ Steven Gong

Table of Contents

Attention (Transformer)

Attention mechanism

Attention Layer

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Attention (Transformer)

Attention mechanism

Attention Layer

Related

Graph View

Backlinks