LGOCMLFeb 19, 2019

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

arXiv:1902.07111v268 citations
AI Analysis

This provides theoretical guarantees for adaptive gradient methods in non-convex neural network optimization, which could benefit practitioners in deep learning.

The authors tackled the problem of proving global convergence for adaptive gradient methods in neural networks, showing that for sufficiently wide two-layer over-parameterized networks, their proposed method converges to the global minimum in polynomial time without requiring hyperparameter fine-tuning.

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes