Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)
Invariance-based methods = apply a bunch of augmentations to the original image, and make sure the embeddings generated from those images are very similar (invariant to augmentations)
- This method is highly biased towards the augmentations applied, as the augmentations are hand-crafted. The model’s ability to generalize is not so clear
Generative methods = reconstructing the original image
- I was confused about this, since I thought that for generative architecture, we only needed “y”, not “x”?
- They explain it in the paragraph
- The Masked Autoencoder paper is an example of this
Very similar to MAE, except that instead of having the loss on the input space (image pixels), it’s applied on the embedding space.