CV LGFeb 13, 2024

Peeking Behind the Curtains of Residual Learning

Tunhou Zhang, Feng Yan, Hai Li, Yiran Chen

arXiv:2402.08645v12.0h-index: 9

Originality Highly original

AI Analysis

This work addresses a foundational problem in deep learning by providing theoretical insights and a practical method to train deep plain neural nets, which could benefit researchers and practitioners in computer vision and beyond.

The paper tackled the problem of understanding why residual learning succeeds and why plain neural nets fail at depth, uncovering the 'dissipating inputs' phenomenon and proposing the Plain Neural Net Hypothesis (PNNH) to enable training deep plain nets without residual connections, achieving on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.

The utilization of residual learning has become widespread in deep and scalable neural nets. However, the fundamental principles that contribute to the success of residual learning remain elusive, thus hindering effective training of plain nets with depth scalability. In this paper, we peek behind the curtains of residual learning by uncovering the "dissipating inputs" phenomenon that leads to convergence failure in plain neural nets: the input is gradually compromised through plain layers due to non-linearities, resulting in challenges of learning feature representations. We theoretically demonstrate how plain neural nets degenerate the input to random noise and emphasize the significance of a residual connection that maintains a better lower bound of surviving neurons as a solution. With our theoretical discoveries, we propose "The Plain Neural Net Hypothesis" (PNNH) that identifies the internal path across non-linear layers as the most critical part in residual learning, and establishes a paradigm to support the training of deep plain neural nets devoid of residual connections. We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.

View on arXiv PDF

Similar