ADLER -- An efficient Hessian-based strategy for adaptive learning rate
This work addresses the challenge of adaptive learning rate tuning for practitioners in deep learning, offering an efficient alternative to grid search, though it is incremental as it builds on existing Hessian-based methods.
The paper tackles the problem of efficiently computing adaptive learning rates for deep learning by deriving a positive semi-definite Hessian approximation that enables Hessian-vector products, resulting in a strategy that performs comparably to grid search on SGD learning rates with only twice the computational cost of a single SGD run.
We derive a sound positive semi-definite approximation of the Hessian of deep models for which Hessian-vector products are easily computable. This enables us to provide an adaptive SGD learning rate strategy based on the minimization of the local quadratic approximation, which requires just twice the computation of a single SGD run, but performs comparably with grid search on SGD learning rates on different model architectures (CNN with and without residual connections) on classification tasks. We also compare the novel approximation with the Gauss-Newton approximation.