OCLGMLFeb 18, 2018

Spurious Valleys in Two-layer Neural Network Optimization Landscapes

arXiv:1802.06384v495 citations
AI Analysis

This addresses the fundamental challenge of understanding non-convex optimization in neural networks for researchers and practitioners, providing theoretical insights into why gradient descent often succeeds, though it is incremental in building on prior landscape characterization work.

The paper investigates the presence of spurious valleys in the loss landscapes of two-layer neural networks, showing that finite intrinsic dimension ensures no spurious valleys exist in overparametrized models, while infinite intrinsic dimension allows them for certain data distributions, with spurious valleys confined to low risk levels and avoided with high probability in overparametrized cases.

Neural networks provide a rich class of high-dimensional, non-convex optimization problems. Despite their non-convexity, gradient-descent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sub-level sets that do not include a global minimum. Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes