MLLGJul 19, 2018

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

arXiv:1807.07540v523 citations
Originality Incremental advance
AI Analysis

This work provides a unified Bayesian framework for adaptive and non-adaptive optimization methods in deep learning, offering incremental improvements in understanding and potentially enhancing optimizer design.

The authors tackled the problem of neural network optimization by formulating it as Bayesian filtering, which led to the development of AdaBayes, an optimizer that adaptively transitions between SGD-like and Adam-like behavior, automatically recovers AdamW, and achieves generalization performance competitive with SGD.

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are the backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are geing optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes