Nys-Newton: Nyström-Approximated Curvature for Stochastic Optimization
This work addresses the computational bottleneck in second-order optimization for machine learning, offering an incremental improvement in efficiency for large-scale empirical risk minimization tasks.
The paper tackles the challenge of efficiently approximating Newton's method for large-scale stochastic optimization by proposing Nys-Newton, which uses a Nyström approximation on a partial Hessian to compute update steps without full matrix storage. Results show competitive performance with state-of-the-art methods on convex and non-convex functions, supported by theoretical convergence analysis for convex cases.
Second-order optimization methods are among the most widely used optimization approaches for convex optimization problems, and have recently been used to optimize non-convex optimization problems such as deep learning models. The widely used second-order optimization methods such as quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton sketch-based stochastic optimization algorithm for large-scale empirical risk minimization. Specifically, we compute a partial column Hessian of size ($d\times m$) with $m\ll d$ randomly selected variables, then use the \emph{Nyström method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($Δ\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. We then integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradient methods. The results of numerical experiments on both convex and non-convex functions show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method, exhibiting performance competitive with that of state-of-the-art first-order and stochastic quasi-Newton methods. Furthermore, we provide a theoretical convergence analysis for convex functions.