LGAIFeb 4, 2025

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

arXiv:2502.02431v18 citationsh-index: 96Has Code
Originality Incremental advance
AI Analysis

This work provides insights for researchers in deep learning optimization by linking theoretical acceleration methods with practical algorithms, though it is incremental in nature.

The paper connects Schedule-Free optimizers and AdEMAMix with accelerated SGD variants, showing that AdEMAMix performs best on a 150m language modeling task, and introduces Simplified-AdEMAMix, which matches its performance while simplifying the momentum mechanism.

Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes