LGJan 7, 2018

Theory of Deep Learning IIb: Optimization Properties of SGD

arXiv:1801.02254v175 citations
Originality Incremental advance
AI Analysis

This addresses the fundamental problem of understanding why SGD works well for deep learning optimization, offering insights for researchers in machine learning theory.

The paper investigates the optimization behavior of Stochastic Gradient Descent (SGD) in deep convolutional networks, providing theoretical and experimental evidence that SGD tends to concentrate on large-volume, flat minima, which are likely global minimizers.

In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes