CVOct 1, 2023

Self-supervised Learning of Contextualized Local Visual Embeddings

Thalles Santos Silva, Helio Pedrini, Adín Ramírez Rivera

arXiv:2310.00527v33.93 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better self-supervised learning methods in computer vision, particularly for dense prediction tasks like object detection, but it appears incremental as it builds on existing CNN and attention paradigms.

The paper tackled the problem of learning visual representations for dense prediction tasks by proposing CLoVE, a self-supervised method that uses a normalized multi-head self-attention layer to combine local features, resulting in state-of-the-art performance for CNN-based architectures in four tasks.

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

View on arXiv PDF Code

Similar