LG OC MLOct 14, 2025

Cautious Weight Decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu

arXiv:2510.12402v118.811 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses optimization efficiency for machine learning practitioners by offering an incremental improvement to existing methods like AdamW.

The paper tackles the problem of weight decay in optimization by introducing Cautious Weight Decay (CWD), a simple modification that applies decay only to parameters aligned with optimizer updates, which consistently improves final loss and accuracy in language model pre-training and ImageNet classification at large scales.

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

View on arXiv PDF

Similar