CVMay 30, 2022

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

arXiv:2205.15173v25 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the need for better self-supervised pre-training methods for dense prediction tasks in computer vision, offering a novel approach that is incremental over existing contrastive methods.

The paper tackles the problem of self-supervised pre-training for Vision Transformers in dense prediction tasks by introducing a contrastive loss that compares pixel-level to global image representations, resulting in improved local features and demonstrating effectiveness on semantic segmentation and monocular depth estimation with no reduction in batch size.

We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks. It is based on a contrastive loss across views that compares pixel-level representations to global image representations. This strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only. Furthermore, our approach does not suffer from a reduced batch size since the number of negative examples needed in the contrastive loss is in the order of the number of local features. We demonstrate the effectiveness of our pre-training strategy on two dense prediction tasks: semantic segmentation and monocular depth estimation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes