CV AI LG IVJan 19, 2023

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas

arXiv:2301.08243v351.31040 citationsh-index: 137Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of reducing reliance on manual data augmentations for computer vision researchers, though it is incremental as it builds on existing self-supervised and transformer-based approaches.

The paper tackles learning semantic image representations without hand-crafted augmentations by introducing I-JEPA, a non-generative self-supervised method that predicts target block representations from a context block, achieving strong downstream performance on tasks like linear classification and object counting.

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

View on arXiv PDF Code

Similar