LGMay 11

Optimistic Dual Averaging Unifies Modern Optimizers

arXiv:2605.1117261.6
Predicted impact top 35% in LG · last 90 daysOriginality Incremental advance
AI Analysis

Provides a theoretical unification and practical improvement for optimizer design in deep learning, eliminating a common hyperparameter tuning burden.

SODA unifies modern optimizers (Muon, Lion, AdEMAMix, NAdam) under optimistic dual averaging and introduces a wrapper that eliminates weight decay tuning via a 1/k decay schedule, consistently improving performance without extra hyperparameter tuning across various scales.

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes