Transformer
The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.
Transformers can process data in parallel.
A rough draft of the dimensions (inspo from Shape Suffixes)
B: batch size
L: sequence length
H: number of heads
C: channels (also called d_model, n_embed)
V: number of models
input embedding (B, L, C)
When you compute attention, is it (L,C) * (C,L) or (C, L) * (L,C)?
- We want to model the relationship between every token to every other word in the sequence i.e. (L, L)
Q = X * W_Q W_Q = (C,C) W_Q
You get a matrix of shape
(B, h, L, d_k) @ (B, h, d_k, L) → (B, h, L, L)
From 3B1B: There’s masking, so that your current word doesn’t affect the previous word.
- Really good visualization from 3b1b
Resources
- LLM Visualization
- Attention Is All You Need by Vaswani et. al.
- Attention is all you need (Transformer) - Model explanation (including math), Inference and Training by Umar Jamil
- the slides
- Just read this blog
- I really liked these slides from Waterloo’s CS480
Some really good videos:
- History of Transformers
- The Transformer Architecture
- Vision Transformer (An Image is Worth 16x16 Words) video
- Swin Transformer
- Vision Transformers
Attention computes importance.
Feed Forward Layer
This is very straightforward, just 2 layers.
So the left is the attention block.
Think of it as a multi-class classification for 32K tokens.
They use a log loss.
N
is the number of layersDon’t be confused. it’s NOT the number of layers of the feedforward network. This is the blocks.
- is like 40 for LLama
- can be 512
Old
Because you lose the order, you actually add a new vector thing that tells you about the order. And they work super well with Self-Supervised Learning methods, because they can predict masked words on a large corpus.
You can have a pre-trained, and to do fine-tuning.
This is SUPER GOOD: The annotated transformer. http://nlp.seas.harvard.edu/annotated-transformer/
“I wouldn’t recommend diving into papers as a newbie. You aren’t going to be familiar with the jargon and won’t be able to make much sense of things. My advice would be to start with this huggingface course”: https://huggingface.co/course/chapter1/1
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of Natural Language Processing and Computer Vision
They can support parallel operations, which is not the case for RNNs and LSTMs, since those need to be passed sequentially, which doesn’t work very well on GPUs.
The question is, can we parallelize sequential data?
This video is really good.
Transformer, you can kind of think of it as Attention (from an RNN) + CNN
- Self-Attention
- Multi-Head Attention