CVAILGIVJan 19, 2023

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

arXiv:2301.08243v3880 citationsh-index: 137
Originality Incremental advance
AI Analysis

This addresses the problem of reducing reliance on manual data augmentations for computer vision researchers, though it is incremental as it builds on existing self-supervised and transformer-based approaches.

The paper tackles learning semantic image representations without hand-crafted augmentations by introducing I-JEPA, a non-generative self-supervised method that predicts target block representations from a context block, achieving strong downstream performance on tasks like linear classification and object counting.

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Code Implementations7 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes