Bootstrap your own latent: A new approach to self-supervised Learning (BYOL)
Seems like quite an important paper that V-JEPA mentions for preventing mode collapse.
where:
- , are representations from the online encoder given two augmentations of the same image,
- , are representations from the target encoder (parameters updated via EMA),
- is the predictor head on the online branch,
- means stop-gradient (no gradients flow into the target encoder),
- is the squared distance (MSE).
Intuition
Each online view predicts the targetβs representation of the other view. No negatives are involved, collapse is prevented by stop-grad + predictor + momentum target.