Transformer

The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.

Transformers can process data in parallel.

A rough draft of the dimensions (inspo from Shape Suffixes)

B: batch size  
L: sequence length  
H: number of heads
C: channels (also called d_model, n_embed)
V: number of models

input embedding (B, L, C)

When you compute attention, is it (L,C) * (C,L) or (C, L) * (L,C)?

We want to model the relationship between every token to every other word in the sequence i.e. (L, L)

Q = X * W_Q W_Q = (C,C) W_Q

You get a matrix of shape

(B, h, L, d_k) @ (B, h, d_k, L) → (B, h, L, L)

From 3B1B: There’s masking, so that your current word doesn’t affect the previous word.

Resources

Some really good videos:

Attention computes importance.

🛠️ Steven Gong