LGMLMay 15, 2021

On the Distributional Properties of Adaptive Gradients

arXiv:2105.07222v12 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a theoretical gap for researchers in optimization and deep learning, though it is incremental as it refines existing beliefs without introducing new methods.

The paper tackles the lack of understanding of the statistical properties of adaptive gradient methods, showing that under normal gradient distribution, the update variance increases but remains bounded over time, contradicting the belief that variance divergence causes the need for warm-up in Adam.

Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the \textit{update} is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm up of the Adam optimizer, contrary to what is believed in the current literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes