CVMay 30, 2022

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

Jaonary Rabarisoa, Valentin Belissen, Florian Chabot, Quoc-Cuong Pham

arXiv:2205.15173v22.65 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the need for better self-supervised pre-training methods for dense prediction tasks in computer vision, offering a novel approach that is incremental over existing contrastive methods.

The paper tackles the problem of self-supervised pre-training for Vision Transformers in dense prediction tasks by introducing a contrastive loss that compares pixel-level to global image representations, resulting in improved local features and demonstrating effectiveness on semantic segmentation and monocular depth estimation with no reduction in batch size.

We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks. It is based on a contrastive loss across views that compares pixel-level representations to global image representations. This strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only. Furthermore, our approach does not suffer from a reduced batch size since the number of negative examples needed in the contrastive loss is in the order of the number of local features. We demonstrate the effectiveness of our pre-training strategy on two dense prediction tasks: semantic segmentation and monocular depth estimation.

View on arXiv PDF

Similar