On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
This work addresses the optimization bottleneck in training large language models, offering a simple drop-in replacement with consistent gains and negligible computational overhead.
The paper tackles the problem of training large language models by showing that randomly masking parameter updates in adaptive optimizers can be highly effective, with a masked variant of RMSProp outperforming recent state-of-the-art optimizers. Notably, for a 1B model size, their proposed Magma optimizer reduces perplexity by over 19% and 9% compared to Adam and Muon, respectively.
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.