Random Function Descent
This work addresses the problem of inefficient and theoretically unsound optimization methods for machine learning practitioners, offering a novel framework that bridges Bayesian and classical optimization.
The paper tackles the limitations of classical worst-case optimization theory in explaining machine learning optimization success and step size selection by introducing a 'random function' framework, resulting in the development of Random Function Descent (RFD), a scale-invariant method with O(nd) complexity that provides theoretical justification for heuristics like gradient clipping and learning rate warmup.
Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.