Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)

No use of pre-training. Why don’t you believe in the power of pre-training? This is more to show that we need better architecture for better representation learning than auto-regressive models.

Pre-trained on 2 million videos (referred as VideoMix2M), which combines:

HowTo100M (HT) (Miech et al.,2019),
Kinetics-400/600/700 (K710) (Kay et al., 2017),
Something-Something-v2 (SSv2) (Goyal et al., 2017), removes overlap with validation set of Kinetics-400/600/700 and Something-Something-v2.

Loss Objective

They talk about mode collapse if we just use the naive regression objective $minimi z e_{θ, ϕ} ∥ P_{ϕ} (E_{θ} (x), Δ_{y}) - E_{θ} (y) ∥_{1}$

$E_{θ}$ is the encoder network
$P_{ϕ}$ is the predictor network

So they use a stop-gradient operation: $minimi z e_{θ, ϕ} ∥ P_{ϕ} (E_{θ} (x), Δ_{y}) - s g (\overline{E}_{θ} (y)) ∥_{1}$

$\overline{E}_{θ} (y)$ is an EMA of the x-encoder’s weights
$Δ y$ is a learnable mask token that indicates the locations of the dropped patches

Video benchmarks

Image benchmarks

ImageNet

🛠️ Steven Gong

Table of Contents

Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)

Loss Objective

Graph View

Backlinks