MLLGNov 4, 2020

Which Minimizer Does My Neural Network Converge To?

arXiv:2011.02408v26 citations
AI Analysis

This work addresses the problem of understanding training variability in neural networks for researchers and practitioners, but it is incremental as it builds on known issues in overparameterization.

The paper investigates how different training procedures affect which global minimizer an overparameterized neural network converges to, showing that initialization size can harm test performance and adaptive optimizers yield different minimizers than gradient descent, with effects persisting in less overparameterized networks.

The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes