Transformer

The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.

Transformers can process data in parallel.

A rough draft of the dimensions (inspo from Shape Suffixes)

B: batch size  
L: sequence length  
H: number of heads
C: channels (also called d_model, n_embed)
V: number of models

input embedding (B, L, C)

When you compute attention, is it (L,C) * (C,L) or (C, L) * (L,C)?

  • We want to model the relationship between every token to every other word in the sequence i.e. (L, L)

Q = X * W_Q W_Q = (C,C) W_Q

You get a matrix of shape

(B, h, L, d_k) @ (B, h, d_k, L) → (B, h, L, L)

From 3B1B: There’s masking, so that your current word doesn’t affect the previous word.

  • Really good visualization from 3b1b

Resources

Some really good videos:

Attention computes importance.

Attention

Feed Forward Layer

This is very straightforward, just 2 layers.

So the left is the attention block.

Think of it as a multi-class classification for 32K tokens.

They use a log loss.

N is the number of layers

Don’t be confused. it’s NOT the number of layers of the feedforward network. This is the blocks.

  • is like 40 for LLama
  • can be 512

Old

Because you lose the order, you actually add a new vector thing that tells you about the order. And they work super well with Self-Supervised Learning methods, because they can predict masked words on a large corpus.

You can have a pre-trained, and to do fine-tuning.

This is SUPER GOOD: The annotated transformer. http://nlp.seas.harvard.edu/annotated-transformer/

“I wouldn’t recommend diving into papers as a newbie. You aren’t going to be familiar with the jargon and won’t be able to make much sense of things. My advice would be to start with this huggingface course”: https://huggingface.co/course/chapter1/1

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of Natural Language Processing and Computer Vision

They can support parallel operations, which is not the case for RNNs and LSTMs, since those need to be passed sequentially, which doesn’t work very well on GPUs.

The question is, can we parallelize sequential data?

This video is really good.

Transformer, you can kind of think of it as Attention (from an RNN) + CNN

  • Self-Attention
  • Multi-Head Attention