Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
This is an incremental improvement for deep learning practitioners, offering a more robust and memory-efficient optimizer.
The paper tackled the problem of training deep networks by proposing NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay, which performed on par or better than well-tuned SGD with momentum and Adam/AdamW in experiments on image classification, speech recognition, machine translation, and language modeling, and had two times smaller memory footprint than Adam.
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.