LGAug 31, 2021
Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during trainingMaximus Mutschler, Kevin Laube, Andreas Zell
A fundamental challenge in Deep Learning is to find optimal step sizes for stochastic gradient descent automatically. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the full-batch loss is unfeasibly expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest, inter alia, that the full-batch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size changes slowly. By exploiting these and more findings, this work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. Learning rates are derived from such parabolas during training. In the experiments conducted, our approach is on par with SGD with Momentum tuned with a piece-wise constant learning rate schedule and often outperforms other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy. In addition, our approach is the first line search approach for Deep Learning that samples a larger batch size over multiple inferences to still work in low-batch scenarios.
LGMar 31, 2021
Empirically explaining SGD from a line search perspectiveMaximus Mutschler, Andreas Zell
Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.
LGOct 2, 2020
A straightforward line search approach on the expected empirical loss for stochastic deep learning problemsMaximus Mutschler, Andreas Zell
A fundamental challenge in deep learning is that the optimal step sizes for update steps of stochastic gradient descent are unknown. In traditional optimization, line searches are used to determine good step sizes, however, in deep learning, it is too costly to search for good step sizes on the expected empirical loss due to noisy losses. This empirical work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply. This is achieved by applying traditional one-dimensional function fitting to measured noisy losses of such cross sections. The step to a minimum of the resulting approximation is then used as step size for the optimization. This approach leads to a robust and straightforward optimization method which performs well across datasets and architectures without the need of hyperparameter tuning.
LGMar 28, 2019
Parabolic Approximation Line Search for DNNsMaximus Mutschler, Andreas Zell
A major challenge in current optimization research for deep learning is to automatically find optimal step sizes for each update step. The optimal step size is closely related to the shape of the loss in the update step direction. However, this shape has not yet been examined in detail. This work shows empirically that the batch loss over lines in negative gradient direction is mostly convex locally and well suited for one-dimensional parabolic approximations. By exploiting this parabolic property we introduce a simple and robust line search approach, which performs loss-shape dependent update steps. Our approach combines well-known methods such as parabolic approximation, line search and conjugate gradient, to perform efficiently. It surpasses other step size estimating methods and competes with common optimization methods on a large variety of experiments without the need of hand-designed step size schedules. Thus, it is of interest for objectives where step-size schedules are unknown or do not perform well. Our extensive evaluation includes multiple comprehensive hyperparameter grid searches on several datasets and architectures. Finally, we provide a general investigation of exact line searches in the context of batch losses and exact losses, including their relation to our line search approach.