CVMar 22, 2023

Correlational Image Modeling for Self-Supervised Visual Pre-Training

arXiv:2303.12670v312.620 citationsh-index: 128Has Code

Originality Highly original

AI Analysis

This addresses the problem of reducing reliance on labeled data for computer vision tasks, offering a novel approach that is competitive with existing methods.

The paper tackles self-supervised visual pre-training by introducing Correlational Image Modeling (CIM), which predicts correlation maps between cropped image regions and the context, achieving performance on par or better than state-of-the-art methods on benchmarks.

We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.

View on arXiv PDF Code

Similar