CVMay 30, 2022

GMML is All you Need

Sara Atito, Muhammad Awais, Josef Kittler

arXiv:2205.14986v111.221 citationsh-index: 95Has Code

Originality Incremental advance

AI Analysis

This addresses the data-hungry nature of vision transformers for computer vision researchers, offering a simpler alternative to existing self-supervised methods.

The paper tackles the problem of self-supervised pretraining for vision transformers by proposing GMML, which extracts contextual information from images without needing labels, achieving competitive performance on benchmarks like ImageNet.

Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. The source code is publicly available for the community to train on bigger corpora: https://github.com/Sara-Ahmed/GMML.

View on arXiv PDF Code

Similar