LGOCFeb 5

Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

arXiv:2602.05813v11 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses the inefficiency of manual hyperparameter tuning for warm-up in optimizers like Muon and Lion, offering an automated solution that improves training for machine learning practitioners, though it is incremental as it builds on existing optimizer frameworks.

The paper tackles the problem of manually tuning warm-up schedules for norm-constrained optimizers by introducing a generalized smoothness assumption that links local curvature to the suboptimality gap, leading to convergence guarantees where warm-up emerges naturally. It develops an adaptive scheduler that automatically adjusts warm-up duration, showing consistent outperformance or matching of manually tuned schedules in large language model pretraining with LLaMA architectures without extra hyperparameter search.

We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes