LGMLMay 27, 2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

arXiv:1905.11286v348 citations
Originality Incremental advance
AI Analysis

This is an incremental improvement for deep learning practitioners, offering a more robust and memory-efficient optimizer.

The paper tackled the problem of training deep networks by proposing NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay, which performed on par or better than well-tuned SGD with momentum and Adam/AdamW in experiments on image classification, speech recognition, machine translation, and language modeling, and had two times smaller memory footprint than Adam.

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes