LG MLMay 27, 2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen

arXiv:1905.11286v322.048 citations

Originality Incremental advance

AI Analysis

This is an incremental improvement for deep learning practitioners, offering a more robust and memory-efficient optimizer.

The paper tackled the problem of training deep networks by proposing NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay, which performed on par or better than well-tuned SGD with momentum and Adam/AdamW in experiments on image classification, speech recognition, machine translation, and language modeling, and had two times smaller memory footprint than Adam.

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.

View on arXiv PDF

Similar