Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling
This addresses a convergence issue in probabilistic modeling for large datasets, offering a method that improves prediction accuracy in generalized linear models, though it is incremental as it builds on existing stochastic gradient techniques.
The paper tackles the problem of non-convergence in constant-step-size stochastic gradient descent for probabilistic models by proposing to average moment parameters instead of natural parameters. It shows that this approach can lead to better predictions than the best linear model in finite-dimensional cases and always converges to optimal predictions in infinite-dimensional models, with simulations on synthetic and benchmark data.
Stochastic gradient methods enable learning probabilistic models from large amounts of data. While large step-sizes (learning rates) have shown to be best for least-squares (e.g., Gaussian noise) once combined with parameter averaging, these are not leading to convergent algorithms in general. In this paper, we consider generalized linear models, that is, conditional models based on exponential families. We propose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent. For finite-dimensional models, we show that this can sometimes (and surprisingly) lead to better predictions than the best linear model. For infinite-dimensional models, we show that it always converges to optimal predictions, while averaging natural parameters never does. We illustrate our findings with simulations on synthetic data and classical benchmarks with many observations.