Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness
This work addresses a foundational challenge in optimization theory by providing adaptive algorithms for a broad class of functions, with potential impact across machine learning and AI, though it is incremental in extending existing online-to-batch techniques.
The paper tackles the problem of optimizing Hölder smooth functions, which include both smooth and non-smooth cases, by developing adaptive online learning algorithms that achieve optimal regret without prior knowledge of the smoothness parameter, and extends this to yield a universal offline method with accelerated convergence in smooth regimes and near-optimal performance in non-smooth ones.
Smoothness is known to be crucial for acceleration in offline optimization, and for gradient-variation regret minimization in online learning. Interestingly, these two problems are actually closely connected -- accelerated optimization can be understood through the lens of gradient-variation online learning. In this paper, we investigate online learning with Hölder smooth functions, a general class encompassing both smooth and non-smooth (Lipschitz) functions, and explore its implications for offline optimization. For (strongly) convex online functions, we design the corresponding gradient-variation online learning algorithm whose regret smoothly interpolates between the optimal guarantees in smooth and non-smooth regimes. Notably, our algorithms do not require prior knowledge of the Hölder smoothness parameter, exhibiting strong adaptivity over existing methods. Through online-to-batch conversion, this gradient-variation online adaptivity yields an optimal universal method for stochastic convex optimization under Hölder smoothness. However, achieving universality in offline strongly convex optimization is more challenging. We address this by integrating online adaptivity with a detection-based guess-and-check procedure, which, for the first time, yields a universal offline method that achieves accelerated convergence in the smooth regime while maintaining near-optimal convergence in the non-smooth one.