Does Weight Decay Enhance Training Stability?

Marius Saether, Amir Kolic, Tomaso Poggio, Pierfrancesco Beneventano

arXiv:2605.1662249.7

AI Analysis

Provides mechanistic understanding of weight decay's stabilizing effect for deep learning practitioners, revealing limitations of existing stability diagnostics.

Weight decay stabilizes training by slowing progressive sharpening and inducing a phase transition in sharpness dynamics, which is architecture-dependent and translates to function-space stability.

In modern deep learning, weight decay is often credited with "stabilizing" training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: *does weight decay stabilize training dynamics, and if so, through which mechanism?* Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emph{Edge of Stability} (EoS). We show that weight decay robustly slows *progressive sharpening}. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical $\frac{2}η$ boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.

View on arXiv PDF

Similar