Vision Transformer (ViT)
TO UNDERSTAND.
Well it’s basically the same as a normal transformer. You split the image into 16x16 pixel patches, and they are encoded into vector using linear projection. And then you have one extra for positional encoding.
And then you make classification predictions.
CNN trains on less data, because the model doesn’t need to learn how to focus, only what to focus. Transformer needs to “learn” how to focus
- An image is worth 16x16 words paper