Vision Transformer (ViT)

Well it’s basically the same as a normal transformer. You split the image into 16x16 pixel patches, and they are encoded into vector using linear projection. And then you have one extra for positional encoding.

And then you make classification predictions (and variants below):

What are different in each of these from the original ViT paper?

Most of the backbone logic of projecting the image is the same. The difference is in the heads, and how the losses work.

CNN trains on less data, because the model doesn’t need to learn how to focus, only what to focus. Transformer needs to “learn” how to focus.

See An Image is Worth 16x16 Words for notes.