Transformer

The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.

Transformers can process data in parallel.

Architecturally, there’s nothing about the transformer that prevents it from scaling to longer sequences. It is simply that when we compute the attention , we end up with a dimension, and that can become really really big.

A rough draft of the dimensions (inspo from Shape Suffixes)

B: batch size  
L: sequence length  
H: number of heads
C: channels (also called d_model, n_embed)
V: number of models

input embedding (B, L, C)

Components

  • Feedforward
  • LayerNorm

When you compute attention, is it (L,C) * (C,L) or (C, L) * (L,C)?

  • We want to model the relationship between every token to every other word in the sequence i.e. (L, L)

You get a matrix of shape

(B, h, L, d_k) @ (B, h, d_k, L) → (B, h, L, L)

From 3B1B: There’s masking, so that your current word doesn’t affect the previous word.

  • Really good visualization from 3b1b

Resources

Some really good videos:

Attention computes importance.

So the left is the attention block.

Think of it as a multi-class classification for 32K tokens.

N is the number of layers

Don’t be confused. it’s NOT the number of layers of the feedforward network. This is the number of blocks.

  • is like 40 for LLama
  • can be 512

Implementation Details

The part about how training is fed got me choked up, from reading Annotated transformer (reading it is really really helpful though).

Gold target:

<bos>   I   like   eating   mushrooms   <eos>

When we build inputs/labels:

  • labels (trg_y) = I like eating mushrooms <eos>

So the very first training example is:

  • Input to decoder at position 0: <bos>
  • Label at position 0: I

That means: the model is explicitly trained to predict the first word (ā€œIā€) given only <bos> and the encoder context.