AILGOCMay 11

Optimizer-Induced Mode Connectivity: From AdamW to Muon

arXiv:2605.0999115.4
Predicted impact top 61% in AI · last 90 daysOriginality Incremental advance
AI Analysis

It reveals that optimizer choice induces implicit regularization that structures the loss landscape, providing a new perspective for understanding and comparing optimization algorithms in deep learning.

This paper investigates how the choice of optimizer (AdamW, Muon, etc.) affects the connectivity of loss landscape solutions. For two-layer ReLU networks at large width, solutions from a single optimizer form a connected set, while different optimizers can yield disconnected components separated by a loss barrier, as shown theoretically and empirically in GPT-2 pretraining.

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes