CVOct 18, 2024

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

arXiv:2410.14874v25 citationsh-index: 10
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for vision tasks, potentially benefiting researchers and practitioners using Vision Transformers.

The paper tackles the problem of enhancing Vision Transformers by overlapping heads in Multi-Head Self-Attention, introducing MOHSA, which yields a significant performance boost on four benchmark datasets.

Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes