A Stochastic Proximal Polyak Step Size
This work addresses the tuning and stability issues in stochastic optimization for regularized problems, which is incremental but important for practitioners in machine learning.
The authors tackled the challenge of adapting the stochastic Polyak step size (SPS) to handle regularization by developing ProxSPS, which only requires a lower bound for the loss rather than the entire objective, making it easier to tune and more stable. For image classification, ProxSPS performed as well as AdamW with minimal tuning and resulted in smaller weight parameters.
Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.