CLLGJul 23, 2025

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

arXiv:2507.17634v212 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of optimizing learning rates for large language models, offering a novel method that improves training efficiency and performance, though it is incremental as it builds on existing model merging techniques.

The paper tackles the problem of learning rate scheduling in LLM pre-training by proposing WSM, a decay-free framework that connects learning rate decay to model merging, achieving performance improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro.

Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes