Vision Transformer (ViT)

TO UNDERSTAND.

Well it’s basically the same as a normal transformer. You split the image into 16x16 pixel patches, and they are encoded into vector using linear projection. And then you have one extra for positional encoding.

And then you make classification predictions.

CNN trains on less data, because the model doesn’t need to learn how to focus, only what to focus. Transformer needs to “learn” how to focus

An image is worth 16x16 words paper
- https://arxiv.org/abs/2010.11929

Some popular models that use ViTs:

Why is the linear projection layer needed? just pass the raw flattened image patch?

🧠 Summary ✅ Yes, it’s just a linear layer — but:

It mixes pixels and channels into useful feature representations

Acts like a 1-layer perception module

Enables the Transformer to start with positionally aware embeddings

It’s conceptually similar to a 1×1 convolution, but not spatially shared

Each patch is like a token. The smaller the patch, the longer the sequence length required

This is particularly important when you are looking at PaliGemma.

Swin Transformer

🛠️ Steven Gong

Table of Contents

Vision Transformer (ViT)

Next

Graph View

Backlinks