The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights
This provides a unified framework for understanding noise in optimization, relevant for machine learning practitioners, but it is incremental as it builds on existing gradient descent and regularization theories.
The paper analyzes gradient descent with random data weights in linear regression, characterizing the implicit regularization and deriving non-asymptotic convergence bounds, showing that fast-converging weightings can lead to poor statistical performance.
We analyze gradient descent with randomly weighted data points in a linear regression model, under a generic weighting distribution. This includes various forms of stochastic gradient descent, importance sampling, but also extends to weighting distributions with arbitrary continuous values, thereby providing a unified framework to analyze the impact of various kinds of noise on the training trajectory. We characterize the implicit regularization induced through the random weighting, connect it with weighted linear regression, and derive non-asymptotic bounds for convergence in first and second moments. Leveraging geometric moment contraction, we also investigate the stationary distribution induced by the added noise. Based on these results, we discuss how specific choices of weighting distribution influence both the underlying optimization problem and statistical properties of the resulting estimator, as well as some examples for which weightings that lead to fast convergence cause bad statistical performance.