OCMLFeb 12, 2015

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

arXiv:1502.03571v543 citations
AI Analysis

This work addresses large-scale linear regression problems in machine learning and data analysis by bridging stochastic gradient descent and randomized linear algebra, offering incremental improvements in computational efficiency for specific regression types.

The paper tackles constrained overdetermined linear regression problems, such as ℓ₂ and ℓ₁ regression, by proposing a hybrid algorithm (pwSGD) that combines randomized linear algebra for preconditioning with weighted stochastic gradient descent, achieving faster convergence rates dependent only on the lower dimension and improved time complexities, e.g., O(log n·nnz(A) + poly(d)/ε²) for ℓ₁ regression with ε relative error.

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Particularly, when solving $\ell_1$ regression with size $n$ by $d$, pwSGD returns an approximate solution with $ε$ relative error in the objective value in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d)/ε^2)$ time. This complexity is uniformly better than that of RLA methods in terms of both $ε$ and $d$ when the problem is unconstrained. For $\ell_2$ regression, pwSGD returns an approximate solution with $ε$ relative error in the objective value and the solution vector measured in prediction norm in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d) \log(1/ε) /ε)$ time. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes