Gradient Flow Convergence Guarantee for General Neural Network Architectures
This work addresses a foundational theoretical problem in machine learning by offering a general convergence guarantee, though it is incremental as it builds on prior specific results.
The paper tackles the challenge of explaining gradient-based optimization success in deep learning by providing a unified proof for linear convergence of gradient flow for any neural network with piecewise non-zero polynomial, ReLU, or sigmoid activations, consolidating existing results under weaker assumptions and showing empirical agreement with practical gradient descent.
A key challenge in modern deep learning theory is to explain the remarkable success of gradient-based optimization methods when training large-scale, complex deep neural networks. Though linear convergence of such methods has been proved for a handful of specific architectures, a united theory still evades researchers. This article presents a unified proof for linear convergence of continuous gradient descent, also called gradient flow, while training any neural network with piecewise non-zero polynomial activations or ReLU, sigmoid activations. Our primary contribution is a single, general theorem that not only covers architectures for which this result was previously unknown but also consolidates existing results under weaker assumptions. While our focus is theoretical and our results are only exact in the infinitesimal step size limit, we nevertheless find excellent empirical agreement between the predictions of our result and those of the practical step-size gradient descent method.