I-JEPA

Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)

No use of pre-training. Why don’t you believe in the power of pre-training? This is more to show that we need better architecture for better representation learning than auto-regressive models.

Pre-trained on 2 million videos (referred as VideoMix2M), which combines:

  • HowTo100M (HT) (Miech et al.,2019),
  • Kinetics-400/600/700 (K710) (Kay et al., 2017),
  • Something-Something-v2 (SSv2) (Goyal et al., 2017), removes overlap with validation set of Kinetics-400/600/700 and Something-Something-v2.

Loss Objective

They talk about mode collapse if we just use the naive regression objective

  • is the encoder network
  • is the predictor network

So they use a stop-gradient operation:

  • is an EMA of the x-encoder’s weights
  • is a learnable mask token that indicates the locations of the dropped patches

Video benchmarks

Image benchmarks