Vision Transformer (ViT)
Well it’s basically the same as a normal transformer. You split the image into 16x16 pixel patches, and they are encoded into vector using linear projection. And then you have one extra for positional encoding.
And then you make classification predictions (and variants below):
What are different in each of these from the original ViT paper?
Most of the backbone logic of projecting the image is the same. The difference is in the heads, and how the losses work.
CNN trains on less data, because the model doesn’t need to learn how to focus, only what to focus. Transformer needs to “learn” how to focus.
See An Image is Worth 16x16 Words for notes.
Architecture (CS231n 2025 Lec 8)
Take the standard Transformer block (designed for sets of vectors) and feed it image patches:
- Patchify: split a 224×224×3 image into 196 non-overlapping 16×16×3 patches.
- Flatten + linear project: each 768-dim patch (16·16·3) goes through a learned linear . Equivalently, this is a 16×16 conv with stride 16, 3 input channels, output channels.
- Add positional encoding to each patch embedding so the Transformer knows the 2D position. Self-attention is permutation-equivariant on its own, so without PE the model would be blind to image layout.
- Run through Transformer blocks with no masking — every patch attends to every other patch (unlike LM, which masks future tokens).
- Pool + classify: average-pool the output to , then linear for class scores.
Why bother (vs CNN)? Each output patch depends directly on every input patch in one layer — no need to stack many conv layers to grow the receptive field. Trade-off is the attention cost.
Source
CS231n 2025 Lec 8 slides 100–109 (ViT patchification, linear projection = strided conv, positional encoding, no masking, average pool + classifier). Dosovitskiy et al. ICLR 2021.
Video ViTs (CS231n 2025 Lec 10)
ViTs extend to video by tokenizing spatio-temporal patches. Three families:
- Factorized attention: split spatial and temporal self-attention into separate blocks instead of one expensive joint attention. ViViT (Arnab et al ICCV 2021), TimeSformer (Bertasius et al ICML 2021), Video Transformer Network (Neimark et al ICCV 2021).
- Pooling modules: progressively pool tokens across stages, like a CNN feature pyramid. MViTv1 (Fan et al ICCV 2021), MViTv2 (Li et al CVPR 2022).
- Video masked autoencoders: pretrain by masking a high fraction of spatio-temporal patches and reconstructing — works very well at scale. VideoMAE (Tong et al NeurIPS 2022), VideoMAE V2 (Wang et al CVPR 2023), Feichtenhofer NeurIPS 2022.
Kinetics-400 top-1: I3D 71.1 → SlowFast+NL 79.8 → MViTv2-L 86.1 → VideoMAE V2-g 90. The image-classification recipe (better backbones, masked-image pretraining) transfers cleanly to video.
Source
CS231n 2025 Lec 10 slides 75–78 (Video ViT taxonomy, Kinetics-400 final accuracy chart). 2026 PDF not published — using 2025 fallback.