Transformer from Scratch

Writing some notes on how I’d like to do this from scratch.

Also some questions that arised as I built this.

Afterwards, these are the places that you can compare to:

Input Preprocessing

Can we get away with just using ASCII as opposed to unicode?

  • Was trying to do it with just utf-8, but that also means from visualization perspective, it will be super ugly, because utf-8 have variable byte encoding for each “character”, so the model might produce something invalid.

Tokenization process:

Variable Naming

The Shape Suffixes convention was super helpful here.

Multihead attention

How is this implemented?

Was thinking of doing something like:

for head in heads:
    Q,K,V = slice(head)
    attention(Q,K,V)

concat(Q,K,V)

In practice though, this is quite slow. I did a view and reshaping, so then it will just do it over each head:

def multihead_attention(self, input_LD):
    Q_HLK = self.WQ(input_LD).view(L,H,K).permute(1, 0, 2)
    K_HLK = self.WK(input_LD).view(L,H,K).permute(1, 0, 2)
    V_HLK = self.WV(input_LD).view(L,H,K).permute(1, 0, 2)
  • The view and permute function are quite helpful here