Theory of Deep Learning IIb: Optimization Properties of SGD
This addresses the fundamental problem of understanding why SGD works well for deep learning optimization, offering insights for researchers in machine learning theory.
The paper investigates the optimization behavior of Stochastic Gradient Descent (SGD) in deep convolutional networks, providing theoretical and experimental evidence that SGD tends to concentrate on large-volume, flat minima, which are likely global minimizers.
In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers