Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam
This work addresses the problem of adaptive learning rate tuning for deep learning practitioners, but it is incremental as it modifies an existing method rather than introducing a new paradigm.
The paper revisits a cumulative path-based learning rate adaptation scheme for SGD and Adam, showing that the original approach is inconsistent for Adam due to its preconditioning and proposing a corrected variant. It benchmarks these methods against alternatives to clarify their practical benefits, with results indicating that the corrected variant improves Adam's performance in certain scenarios, though specific numerical gains are not detailed in the abstract.
The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer's internal preconditioning. We propose a corrected variant that better reflects Adam's update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.