LGSep 30, 2025

Cutting the Skip: Training Residual-Free Transformers

arXiv:2510.00345v15 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses a fundamental optimization challenge in transformer training for vision models, potentially enabling more hierarchical representation learning, though it is incremental as it modifies initialization rather than architecture.

The paper tackled the problem of training transformers without skip connections, which are typically needed for stability, and showed that a principled initialization strategy enables stable training of skipless Vision Transformers, leading to richer hierarchical representations and outperforming baselines with skip connections on dense prediction benchmarks.

Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes