LGAIDec 5, 2025

Scaling and Transferability of Annealing Strategies in Large Language Model Training

arXiv:2512.13705v11 citations
Originality Incremental advance
AI Analysis

This provides practical guidance for efficiently training large language models, though it is incremental as it refines existing scheduling approaches.

The paper tackles the challenge of optimizing learning rate annealing strategies for large language models by developing a predictive framework that enables transfer of optimal annealing ratios from smaller to larger models, eliminating the need for exhaustive hyperparameter searches.

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes