CVLGJun 1, 2022

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

arXiv:2206.00481v23 citationsh-index: 30Has Code
Originality Incremental advance
AI Analysis

This addresses the issue of ViTs requiring large datasets for accuracy, benefiting researchers and practitioners in computer vision with limited data.

The paper tackles the problem of Vision Transformers (ViTs) underperforming on small datasets due to lack of inductive bias by proposing RelViT, a self-supervised learning strategy that uses patch relations as tasks, which improves SSL state-of-the-art methods by a large margin, especially on small datasets.

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes