Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
This work provides a theoretical foundation for understanding optimization algorithms under more realistic gradient growth conditions, benefiting the machine learning community by guiding algorithm selection for non-smooth problems.
The paper introduces a class of generalized Lipschitz functions where gradient norms are bounded by an affine function of the optimality gap, and shows that AdamW with clipped updates achieves the best global convergence rates for convex stochastic optimization in this setting, outperforming SGD and AdaGrad.
Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.