Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

$M u lt i He a d (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}$ where $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

Where the projections are parameter matrices

W_{i}^{Q} \in R^{d_{model} \times d_{k}}, W_{i}^{K} \in R^{d_{model} \times d_{k}}, W_{i}^{V} \in R^{d_{model} \times d_{v}}, W^{O} \in R^{h d_{v} \times d_{model}}

In the original paper, they employ h = 8 parallel attention layers, or heads. For each of these we use $d_{k} = d_{v} = d_{m o d e l} / h = 64$ .

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality

from https://www.youtube.com/watch?v=bCz4OMemCcA

I don't understand, don't the heads just end up doing the same things?

We want each head to watch different aspects of the same word.

Intuitively, it could happen. But here’s why it usually doesn’t:

Each head has its own $W_{q}$ , $W_{k}$ , $W_{v}$ matrices, all initialized differently.
During training, if two heads start doing the same thing, they don’t both get rewarded equally — gradients nudge them to specialize and reduce redundancy.
Why? Because doing the same thing doesn’t reduce the loss as effectively as learning different complementary patterns.

Source: https://www.youtube.com/watch?v=0VLAoVGf_74

🛠️ Steven Gong

Multi-Head Attention

Graph View

Backlinks