Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
This work addresses a subtle but critical failure mode in LLM training for researchers and practitioners, revealing that normalization-optimizer coupling can lead to overlooked performance drops without causing obvious errors like NaNs.
The paper investigates the interaction between normalization layers and optimizers in LLM training, finding that Dynamic Erf (Derf) suffers a significant performance penalty when paired with the Muon optimizer, with the gap to RMSNorm increasing from +0.31 nats under AdamW to +0.97 under Muon, and proposes fixes like an EMA-blend or adjusting alpha to recover most of the loss.
In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf's published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.