Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)

Invariance-based methods = apply a bunch of augmentations to the original image, and make sure the embeddings generated from those images are very similar (invariant to augmentations)

  • This method is highly biased towards the augmentations applied, as the augmentations are hand-crafted. The model’s ability to generalize is not so clear

Generative methods = reconstructing the original image

  • I was confused about this, since I thought that for generative architecture, we only needed “y”, not “x”?

Very similar to MAE, except that instead of having the loss on the input space (image pixels), it’s applied on the embedding space.

How does one apply loss on embedding space?