LG NE MLMar 23, 2020

Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Charles G. Frye, James Simon, Neha S. Wadia, Andrew Ligeralde, Michael R. DeWeese, Kristofer E. Bouchard

arXiv:2003.10397v15.85 citations

Originality Synthesis-oriented

AI Analysis

This work highlights a methodological flaw in analyzing neural network optimization, which is incremental but important for researchers in deep learning theory.

The authors identified that existing methods for finding critical points in deep neural network loss functions often converge to gradient-flat regions where the gradient norm is stationary, rather than true critical points, challenging past interpretations and impacting second-order optimization design.

Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstrating that neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.

View on arXiv PDF

Similar