CVOct 1, 2023

Self-supervised Learning of Contextualized Local Visual Embeddings

arXiv:2310.00527v33 citationsh-index: 15
AI Analysis

This addresses the need for better self-supervised learning methods in computer vision, particularly for dense prediction tasks like object detection, but it appears incremental as it builds on existing CNN and attention paradigms.

The paper tackled the problem of learning visual representations for dense prediction tasks by proposing CLoVE, a self-supervised method that uses a normalized multi-head self-attention layer to combine local features, resulting in state-of-the-art performance for CNN-based architectures in four tasks.

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes