$ε$-rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics
This work addresses a fundamental problem in deep learning theory for researchers and practitioners, offering incremental insights into training dynamics.
The paper tackles the challenge of understanding neural network training dynamics by introducing ε-rank, a metric for effective features, and observes a universal staircase phenomenon where loss reduction correlates with increasing ε-rank. They propose a pre-training strategy to eliminate this phenomenon, reducing training time and improving accuracy across tasks.
Understanding the training dynamics of deep neural networks (DNNs), particularly how they evolve low-dimensional features from high-dimensional data, remains a central challenge in deep learning theory. In this work, we introduce the concept of $ε$-rank, a novel metric quantifying the effective feature of neuron functions in the terminal hidden layer. Through extensive experiments across diverse tasks, we observe a universal staircase phenomenon: during training process implemented by the standard stochastic gradient descent methods, the decline of the loss function is accompanied by an increase in the $ε$-rank and exhibits a staircase pattern. Theoretically, we rigorously prove a negative correlation between the loss lower bound and $ε$-rank, demonstrating that a high $ε$-rank is essential for significant loss reduction. Moreover, numerical evidences show that within the same deep neural network, the $ε$-rank of the subsequent hidden layer is higher than that of the previous hidden layer. Based on these observations, to eliminate the staircase phenomenon, we propose a novel pre-training strategy on the initial hidden layer that elevates the $ε$-rank of the terminal hidden layer. Numerical experiments validate its effectiveness in reducing training time and improving accuracy across various tasks. Therefore, the newly introduced concept of $ε$-rank is a computable quantity that serves as an intrinsic effective metric characteristic for deep neural networks, providing a novel perspective for understanding the training dynamics of neural networks and offering a theoretical foundation for designing efficient training strategies in practical applications.