LGOCMLOct 3, 2025

Why Do We Need Warm-up? A Theoretical Perspective

arXiv:2510.03164v16 citationsh-index: 44
Originality Highly original
AI Analysis

This provides a principled explanation for a ubiquitous heuristic in deep learning, addressing a foundational problem for practitioners and researchers.

The paper tackles the lack of theoretical understanding for why learning rate warm-up improves training in deep learning, proving that under a generalized smoothness condition, warm-up schedules achieve faster convergence than fixed step-sizes, with validated experiments on language and vision models.

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes