ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

arXiv:2605.1909576.31 citations

AI Analysis

For practitioners training large language models, this work provides a learning-rate-free and schedule-free method that outperforms current state-of-the-art schedules, reducing the need for hyperparameter tuning.

Schedule-Free Learning was scaled to large language models by identifying necessary fixes for larger batch and model sizes. The resulting method, ScheduleFree+, outperforms Warmup-Stable-Decay schedules, achieving a 31% improvement at 1000 tokens per parameter.

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

View on arXiv PDF

Similar