PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

arXiv:2605.1033571.1Has Code
AI Analysis

For practitioners training large-scale neural networks, PowerStep offers a principled way to reduce optimizer memory overhead without sacrificing performance.

PowerStep is a memory-efficient adaptive optimizer that eliminates the need for second-moment statistics, achieving coordinate-wise adaptivity via ℓ_p-norm steepest descent. It matches Adam's convergence speed on Transformer models up to 235B parameters while halving optimizer memory, and with int8 quantization reduces memory by ~8×.

Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes