CVAILGFeb 15, 2024

Revisiting Feature Prediction for Learning Visual Representations from Video

arXiv:2404.08471v1317 citationsh-index: 25Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses the problem of learning versatile visual representations from video without supervision for computer vision researchers, presenting a novel approach but with incremental elements.

The paper tackles unsupervised visual representation learning from video using a feature prediction objective, achieving strong performance on downstream tasks without adaptation, with their largest model reaching 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes