LGApr 2, 2025

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp

arXiv:2504.01898v16 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses optimization challenges in machine learning by providing a theoretically grounded step size method, with incremental improvements in convergence analysis and practical application to model distillation.

The paper introduces an idealized stochastic Polyak step size (SPS*) that achieves optimal convergence rates for convex functions under local gradient bounds, including O(1/√t) anytime convergence in smooth settings, and applies it to distill a GPT-2 teacher model into a smaller student model without hyperparameter tuning.

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

View on arXiv PDF

Similar