Transformer from Scratch
Writing some notes on how I’d like to do this from scratch.
Also some questions that arised as I built this.
Afterwards, these are the places that you can compare to:
- https://pytorch.org/tutorials/intermediate/transformer_building_blocks.html
- HuggingFace transformers
- xformers
- torchtune
- Also see LLM Interview Prep where they linked resources
Input Preprocessing
Can we get away with just using ASCII as opposed to unicode?
- Was trying to do it with just utf-8, but that also means from visualization perspective, it will be super ugly, because utf-8 have variable byte encoding for each “character”, so the model might produce something invalid.
Tokenization process:
- In practice, they do it batched https://huggingface.co/docs/datasets/use_dataset
Variable Naming
The Shape Suffixes convention was super helpful here.
Multihead attention
How is this implemented?
Was thinking of doing something like:
for head in heads:
Q,K,V = slice(head)
attention(Q,K,V)
concat(Q,K,V)
In practice though, this is quite slow. I did a view and reshaping, so then it will just do it over each head:
def multihead_attention(self, input_LD):
Q_HLK = self.WQ(input_LD).view(L,H,K).permute(1, 0, 2)
K_HLK = self.WK(input_LD).view(L,H,K).permute(1, 0, 2)
V_HLK = self.WV(input_LD).view(L,H,K).permute(1, 0, 2)
- The
view
andpermute
function are quite helpful here