OCLGMLJan 18, 2016

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

arXiv:1601.04737v394 citations
AI Analysis

This work addresses computational efficiency for machine learning practitioners dealing with large datasets, but it is incremental as it builds on existing second-order optimization methods with sub-sampling techniques.

The paper tackles the problem of large-scale optimization in machine learning by analyzing variants of Newton's method that use uniform sub-sampling to estimate gradients and/or Hessians, providing non-asymptotic and quantitative convergence bounds that guarantee global convergence from any initial iterate.

Large scale optimization problems are ubiquitous in machine learning and data analysis and there is a plethora of algorithms for solving such problems. Many of these algorithms employ sub-sampling, as a way to either speed up the computations and/or to implicitly implement a form of statistical regularization. In this paper, we consider second-order iterative optimization algorithms and we provide bounds on the convergence of the variants of Newton's method that incorporate uniform sub-sampling as a means to estimate the gradient and/or Hessian. Our bounds are non-asymptotic and quantitative. Our algorithms are global and are guaranteed to converge from any initial iterate. Using random matrix concentration inequalities, one can sub-sample the Hessian to preserve the curvature information. Our first algorithm incorporates Hessian sub-sampling while using the full gradient. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or ridge-type regularization. Next, in addition to Hessian sub-sampling, we also consider sub-sampling the gradient as a way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra to obtain the proper sampling strategy. In all these algorithms, computing the update boils down to solving a large scale linear system, which can be computationally expensive. As a remedy, for all of our algorithms, we also give global convergence results for the case of inexact updates where such linear system is solved only approximately. This paper has a more advanced companion paper, [42], in which we demonstrate that, by doing a finer-grained analysis, we can get problem-independent bounds for local convergence of these algorithms and explore trade-offs to improve upon the basic results of the present paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes