LGAIAug 2, 2025

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

arXiv:2508.01483v19 citationsh-index: 7Trans. Mach. Learn. Res.
Originality Synthesis-oriented
AI Analysis

This work provides incremental insights into optimizing learning rate scheduling for transformer training, benefiting practitioners in deep learning.

The paper analyzes the cooldown phase in the Warmup-Stable-Decay learning rate scheduler for transformers, revealing that different cooldown shapes affect model bias-variance trade-offs and performance, with improvements from higher AdamW β2 values and empirical support for the river valley loss perspective.

Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode{x2013}$ comparable to those from cooldown shape selection $\unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $β_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes