LGCVMay 4, 2025

Always Skip Attention

arXiv:2505.01996v312 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses a critical training instability problem for researchers and practitioners using Vision Transformers, though it is incremental as it builds on existing skip connection mechanisms.

The paper identifies that self-attention in Vision Transformers catastrophically fails to train without skip connections, unlike other components or previous architectures like CNNs, and proposes Token Graying as a complementary method to improve input token conditioning, validated in supervised and self-supervised training.

We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes