Convergence of gradient descent for learning linear neural networks
This provides theoretical guarantees for optimization in deep learning, but it is incremental as it extends prior gradient flow analysis to discrete gradient descent.
The paper analyzes gradient descent convergence for training deep linear neural networks, showing it reaches critical points under specific step sizes and global minima for two-layer networks from almost all initializations, with convergence to rank-constrained minima for three or more layers.
We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.