Vision Transformer (ViT)

TO UNDERSTAND.

Well it’s basically the same as a normal transformer. You split the image into 16x16 pixel patches, and they are encoded into vector using linear projection. And then you have one extra for positional encoding.

And then you make classification predictions.

CNN trains on less data, because the model doesn’t need to learn how to focus, only what to focus. Transformer needs to “learn” how to focus

Some popular models that use ViTs:

Why is the linear projection layer needed? just pass the raw flattened image patch?

🧠 Summary ✅ Yes, it’s just a linear layer — but:

  • It mixes pixels and channels into useful feature representations
  • Acts like a 1-layer perception module
  • Enables the Transformer to start with positionally aware embeddings
  • It’s conceptually similar to a 1×1 convolution, but not spatially shared

Each patch is like a token. The smaller the patch, the longer the sequence length required

  • This is particularly important when you are looking at PaliGemma.

Next