The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing
This addresses a bottleneck in optimization for machine learning by accelerating convergence in overparameterized settings, though it is incremental as it builds on existing preconditioning ideas.
The paper tackles the low-rank matrix sensing problem with unknown rank and ill-conditioned matrices by proposing ScaledGD(λ), a preconditioned gradient descent method, which converges to the true matrix at a constant linear rate with only logarithmic dependency on condition number and dimension, significantly improving over vanilla GD's polynomial dependency.
We propose $\textsf{ScaledGD($λ$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($λ$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($λ$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($λ$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.