Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence
This provides theoretical insights into a widely used but poorly understood technique in deep learning, though it is incremental as it builds on existing optimization theory.
The paper tackles the lack of theoretical understanding of learning rate warmup in deep learning by proposing a novel generalized smoothness assumption and analyzing gradient descent convergence. It shows that warmup can accelerate convergence by up to Θ(T) times faster than non-increasing schedules in specific cases.
Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most $Θ(T)$ times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.