CLJun 4

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

Yongwei Zhou, Juncheng Diao, Junlin Shang, Peiguang Li, Rongxiang Weng

arXiv:2606.0561085.6

Predicted impact top 36% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners fine-tuning LLMs, this provides a principled method to avoid costly hyperparameter searches, though the approach is incremental as it builds on existing scaling law concepts.

The paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training and proposes a framework to predict them from compute budget, reducing search overhead by up to 90% while maintaining performance.

The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.

View on arXiv PDF

Similar