Transformer
The main motivation for the transformer architecture was to improve the ability of neural networks to handle sequential data.
Transformers can process things in parallel.
Resources
- Attention Is All You Need by Vaswani et. al.
- Just read this blog
- I really liked these slides from Waterlooโs CS480 Some really good videos:
- History of Transformers
- The Transformer Architecture
- Vision Transformer (An Image is Worth 16x16 Words) video
- Swin Transformer
- Vision Transformers
Attention computes importance.
- Inputs:
- Query
- Value
- Key
Output: an matrix
Attention mechanism
Then (Softmax operation is row-wise, i.e., ):
In matrix form
\text{softmax}\left(\frac{\langle \mathbf{q}_1, \mathbf{k}_1 \rangle}{\sqrt{d}}\right)\mathbf{v}_1^T + \cdots + \text{softmax}\left(\frac{\langle \mathbf{q}_1, \mathbf{k}_n \rangle}{\sqrt{d}}\right)\mathbf{v}_n^T \\ \vdots \\ \text{softmax}\left(\frac{\langle \mathbf{q}_m, \mathbf{k}_1 \rangle}{\sqrt{d}}\right)\mathbf{v}_1^T + \cdots + \text{softmax}\left(\frac{\langle \mathbf{q}_m, \mathbf{k}_n \rangle}{\sqrt{d}}\right)\mathbf{v}_n^T \end{bmatrix} \in \mathbb{R}^{m \times d}$$ What is $d_k$? I think that is the number of dimensions - It's just a scaling factor - I asked the professor and he said empirically, it gives the best performance #### Attention Layer This is cross-attention ![[attachments/Screenshot 2024-03-05 at 5.06.43 PM.png]] ### Feed Forward Layer This is very straightforward, just 2 layers. ![[attachments/Screenshot 2024-03-05 at 5.11.27 PM.png]] ![[attachments/Screenshot 2024-03-05 at 5.14.37 PM.png]] So the left is the attention block. ![[attachments/Screenshot 2024-03-05 at 5.19.51 PM.png]] Think of it as a multi-class classification for 32K tokens. They use a log loss. > [!warning] $N$ is the number of layers > > > Don't be confused. it's not the number of layers of the feedforward network. This is the blocks. - $N$ is like 40 for LLama - $d$ can be 512 ### Old Because you lose the order, you actually add a new vector thing that tells you about the order. And they work super well with [[notes/Self-Supervised Learning|Self-Supervised Learning]] methods, because they can predict masked words on a large corpus. You can hav ea pre-trained, and to do fine-tuning. This is SUPER GOOD: The annotated transformer. http://nlp.seas.harvard.edu/annotated-transformer/ Seems pretty good: http://nlp.seas.harvard.edu/annotated-transformer/ "I wouldn't recommend diving into papers as a newbie. You aren't going to be familiar with the jargon and won't be able to make much sense of things. My advice would be to start with this huggingface course": [https://huggingface.co/course/chapter1/1](https://huggingface.co/course/chapter1/1 "https://huggingface.co/course/chapter1/1") A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of [[notes/Natural Language Processing|Natural Language Processing]] and [[notes/Computer Vision|Computer Vision]] They can support parallel operations, which is not the case for RNNs and LSTMs, since those need to be passed sequentially, which doesn't work very well on GPUs. The question is, can we parallelize sequential data? This [video](https://www.youtube.com/watch?v=TQQlZhbC5ps&ab_channel=CodeEmporium) is really good. Transformer, you can kind of think of it as Attention (from an RNN) + CNN - Self-Attention - Multi-Head Attention