Acceleration by Random Stepsizes: Hedging, Equalization, and the Arcsine Stepsize Schedule
Provides a simple, theory-driven method to accelerate gradient descent for convex optimization, potentially impacting optimization in machine learning.
Random stepsizes from the Arcsine distribution accelerate Gradient Descent for separable convex optimization, improving the convergence rate from O(k) to O(√k) without momentum or other modifications.
We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the convergence rate from $O(k)$ to $O(\sqrt{k})$, where $k$ is the condition number. No momentum or other algorithmic modifications are required. Our starting point is a remarkable "equalization property" of the Arcsine distribution: it yields an identical convergence rate for all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.