LGOCMLFeb 20, 2020

Scalable Second Order Optimization for Deep Learning

arXiv:2002.09018v20.0063 citations
AI Analysis55

This work addresses the problem of slow optimization in deep learning for practitioners, offering a scalable solution that bridges the gap between theoretical second-order methods and practical applications, though it is incremental as it builds on existing preconditioned methods.

The paper tackles the computational inefficiency of second-order optimization methods in deep learning by presenting a scalable implementation with algorithmic and numerical improvements, achieving significant convergence and wall-clock time improvements on tasks like machine translation, language modeling, click-through rate prediction, and image classification.

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes