LGFeb 25, 2021

Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

arXiv:2102.13233v112 citations
Originality Incremental advance
AI Analysis

This theoretical result highlights a fundamental optimization challenge for deep learning practitioners, as it shows spurious minima are inherent in widely used architectures, potentially hindering training convergence.

The paper proves that deep neural networks with piecewise linear activations, such as ReLU, commonly have spurious local minima when trained on datasets not linearly separable, due to fitting disjoint data groups with different continuous piecewise linear functions, leading to varying empirical risk levels.

In this paper, it is shown theoretically that spurious local minima are common for deep fully-connected networks and convolutional neural networks (CNNs) with piecewise linear activation functions and datasets that cannot be fitted by linear models. A motivating example is given to explain the reason for the existence of spurious local minima: each output neuron of deep fully-connected networks and CNNs with piecewise linear activations produces a continuous piecewise linear (CPWL) output, and different pieces of CPWL output can fit disjoint groups of data samples when minimizing the empirical risk. Fitting data samples with different CPWL functions usually results in different levels of empirical risk, leading to prevalence of spurious local minima. This result is proved in general settings with any continuous loss function. The main proof technique is to represent a CPWL function as a maximization over minimization of linear pieces. Deep ReLU networks are then constructed to produce these linear pieces and implement maximization and minimization operations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes