MLLGNEMay 26, 2016

No bad local minima: Data independent training error guarantees for multilayer neural networks

arXiv:1605.08361v2242 citations
Originality Highly original
AI Analysis

This provides theoretical justification for why stochastic gradient descent works well in practice for training neural networks, addressing a foundational issue in machine learning.

The paper tackles the problem of non-convex optimization in multilayer neural networks by proving that, under mild over-parametrization and with piecewise linear activations, the training error is zero at every differentiable local minimum for almost every dataset, as verified numerically.

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes