Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)

No use of pre-training. Why don’t you believe in the power of pre-training? To show scaling laws?

Video benchmarks

Image benchmarks