LGOCMLOct 14, 2025

Cautious Weight Decay

arXiv:2510.12402v111 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses optimization efficiency for machine learning practitioners by offering an incremental improvement to existing methods like AdamW.

The paper tackles the problem of weight decay in optimization by introducing Cautious Weight Decay (CWD), a simple modification that applies decay only to parameters aligned with optimizer updates, which consistently improves final loss and accuracy in language model pre-training and ImageNet classification at large scales.

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes