OCCOMLSep 14, 2017

The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems

arXiv:1709.04718v29 citations
AI Analysis

This provides a rigorous explanation for a widely observed but poorly understood phenomenon in machine learning optimization, addressing a gap in theoretical understanding.

The paper tackled the problem of why stochastic gradient descent (SGD) prefers flat minimizers over sharp ones in nonconvex optimization, and proposed a deterministic mechanism that accurately explains this phenomenon, verified on two nonconvex problems.

In several experimental reports on nonconvex optimization problems in machine learning, stochastic gradient descent (SGD) was observed to prefer minimizers with flat basins in comparison to more deterministic methods, yet there is very little rigorous understanding of this phenomenon. In fact, the lack of such work has led to an unverified, but widely-accepted stochastic mechanism describing why SGD prefers flatter minimizers to sharper minimizers. However, as we demonstrate, the stochastic mechanism fails to explain this phenomenon. Here, we propose an alternative deterministic mechanism that can accurately explain why SGD prefers flatter minimizers to sharper minimizers. We derive this mechanism based on a detailed analysis of a generic stochastic quadratic problem, which generalizes known results for classical gradient descent. Finally, we verify the predictions of our deterministic mechanism on two nonconvex problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes