Transformer

The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.

Transformers can process data in parallel.

A rough draft of the dimensions (inspo from Shape Suffixes)

B: batch size  
L: sequence length  
H: number of heads
C: channels (also called d_model, n_embed)
V: number of models

input embedding (B, L, C)

When you compute attention, is it (L,C) * (C,L) or (C, L) * (L,C)?

  • We want to model the relationship between every token to every other word in the sequence i.e. (L, L)

Q = X * W_Q W_Q = (C,C) W_Q

You get a matrix of shape

(B, h, L, d_k) @ (B, h, d_k, L) → (B, h, L, L)

From 3B1B: There’s masking, so that your current word doesn’t affect the previous word.

  • Really good visualization from 3b1b

Resources

Some really good videos:

Attention computes importance.

So the left is the attention block.

Think of it as a multi-class classification for 32K tokens.

N is the number of layers

Don’t be confused. it’s NOT the number of layers of the feedforward network. This is the blocks.

  • is like 40 for LLama
  • can be 512

Implementation Details

The part about how training is fed got me choked up, from reading Annotated transformer (reading it is really really helpful though).

Gold target:

<bos>   I   like   eating   mushrooms   <eos>

When we build inputs/labels:

  • labels (trg_y) = I like eating mushrooms <eos>

So the very first training example is:

  • Input to decoder at position 0: <bos>
  • Label at position 0: I

That means: the model is explicitly trained to predict the first word (“I”) given only <bos> and the encoder context.