LGOCMLNov 4, 2019

Sub-Optimal Local Minima Exist for Neural Networks with Almost All Non-Linear Activations

arXiv:1911.01413v315 citations
Originality Highly original
AI Analysis

This is a foundational result that challenges the theoretical understanding of why over-parameterization helps in training neural networks, affecting all of ML/AI.

The paper proves that sub-optimal local minima exist for multi-layer neural networks with generic input data and almost all non-linear activation functions, regardless of network width, contradicting prior claims for 1-hidden-layer networks.

Does over-parameterization eliminate sub-optimal local minima for neural networks? An affirmative answer was given by a classical result in [59] for 1-hidden-layer wide neural networks. A few recent works have extended the setting to multi-layer neural networks, but none of them has proved every local minimum is global. Why is this result never extended to deep networks? In this paper, we show that the task is impossible because the original result for 1-hidden-layer network in [59] can not hold. More specifically, we prove that for any multi-layer network with generic input data and non-linear activation functions, sub-optimal local minima can exist, no matter how wide the network is (as long as the last hidden layer has at least two neurons). While the result of [59] assumes sigmoid activation, our counter-example covers a large set of activation functions (dense in the set of continuous functions), indicating that the limitation is not due to the specific activation. Our result indicates that "no bad local-min" may be unable to explain the benefit of over-parameterization for training neural nets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes