Optimization Hyper-parameter Laws for Large Language Models
This work addresses the challenge of hyper-parameter optimization for AI researchers and practitioners, offering a theoretical framework to improve efficiency in training large models, though it is incremental as it builds on existing scaling laws.
The paper tackles the problem of selecting dynamic hyper-parameters like learning-rate schedules for large language models, which are resource-intensive to train, by introducing Optimization Hyper-parameter Laws (Opt-Laws) that accurately predict training loss and identify optimal schedules, reducing computational costs and enhancing performance across various scenarios.
Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.