CVJun 10, 2022

SERE: Exploring Feature Self-relation for Self-supervised Transformer

arXiv:2206.05184v322 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for self-supervised strategies tailored to ViT properties, offering a domain-specific improvement for vision tasks.

The paper tackles the problem of self-supervised learning for vision transformers (ViT) by proposing SERE, a method that uses feature self-relations (spatial and channel) instead of instance-level discrimination, resulting in stronger representations that stably improve performance on multiple downstream tasks.

Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks. Our source code is publicly available at: https://github.com/MCG-NKU/SERE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes