LGAIMLOct 24, 2025

A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

arXiv:2510.21314v16 citationsh-index: 1
Originality Highly original
AI Analysis

This provides foundational theoretical insights for researchers and practitioners in machine learning, particularly those scaling large language models, though it is incremental in extending existing convergence theories to hardware-aware quantization.

The paper tackles the lack of theoretical understanding for why low-precision training with adaptive optimizers like Adam and Muon remains effective, by introducing the first framework to analyze their convergence under floating-point quantization, showing they retain rates close to full-precision versions with logarithmic mantissa scaling.

The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $β_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes