LGMLFeb 12, 2019

Extreme Tensoring for Low-Memory Preconditioning

arXiv:1902.04620v111 citations
AI Analysis

This addresses memory efficiency for training billion-parameter models, offering a significant reduction in hardware constraints.

The paper tackles the problem of high memory consumption in training large models by proposing extreme tensoring for low-memory preconditioning, reducing optimizer memory overhead by three orders of magnitude without performance degradation in a large-scale NLP model.

State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes