Aaron Sidford

h-index16

55papers

3,025citations

Novelty61%

AI Score49

Ranked #49,724 of 201,326 authors (top 25%)#127 in DS (top 23%)

55 Papers

DSNov 5, 2015

A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization

Yin Tat Lee, Aaron Sidford, Sam Chiu-wai Wong

We improve upon the running time for finding a point in a convex set given a separation oracle. In particular, given a separation oracle for a convex set $K\subset \mathbb{R}^n$ contained in a box of radius $R$, we show how to either find a point in $K$ or prove that $K$ does not contain a ball of radius $ε$ using an expected $O(n\log(nR/ε))$ oracle evaluations and additional time $O(n^3\log^{O(1)}(nR/ε))$. This matches the oracle complexity and improves upon the $O(n^{ω+1}\log(nR/ε))$ additional time of the previous fastest algorithm achieved over 25 years ago by Vaidya for the current matrix multiplication constant $ω<2.373$ when $R/ε=n^{O(1)}$. Using a mix of standard reductions and new techniques, our algorithm yields improved runtimes for solving classic problems in continuous and combinatorial optimization: Submodular Minimization: Our weakly and strongly polynomial time algorithms have runtimes of $O(n^2\log nM\cdot\text{EO}+n^3\log^{O(1)}nM)$ and $O(n^3\log^2 n\cdot\text{EO}+n^4\log^{O(1)}n)$, improving upon the previous best of $O((n^4\text{EO}+n^5)\log M)$ and $O(n^5\text{EO}+n^6)$. Matroid Intersection: Our runtimes are $O(nrT_{\text{rank}}\log n\log (nM) +n^3\log^{O(1)}(nM))$ and $O(n^2\log (nM) T_{\text{ind}}+n^3 \log^{O(1)} (nM))$, achieving the first quadratic bound on the query complexity for the independence and rank oracles. In the unweighted case, this is the first improvement since 1986 for independence oracle. Submodular Flow: Our runtime is $O(n^2\log nCU\cdot\text{EO}+n^3\log^{O(1)}nCU)$, improving upon the previous bests from 15 years ago roughly by a factor of $O(n^4)$. Semidefinite Programming: Our runtime is $\tilde{O}(n(n^2+m^ω+S))$, improving upon the previous best of $\tilde{O}(n(n^ω+m^ω+S))$ for the regime where the number of nonzeros $S$ is small.

DSJan 28, 2013

A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time

Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford et al.

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a "nice" spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

82.0DSJun 3

Randomization for Faster Exact Optimization of Discounted Markov Decision Processes

Andrei Graur, Aaron Sidford, Ta-Wei Tu

We provide faster deterministic and randomized algorithms for exactly solving discounted Markov Decision Processes (DMDPs). We obtain our results by efficiently reducing computing optimal values and policies in DMDPs to the easier tasks of policy evaluation and computing approximately optimal values in DMDPs. We provide both a straightforward deterministic reduction and a more efficient randomized variant that, together with advances in approximately solving DMDPs, yield our results.

DSMar 5, 2015

Path Finding I :Solving Linear Programs with Õ(sqrt(rank)) Linear System Solves

Yin Tat Lee, Aaron Sidford

In this paper we present a new algorithm for solving linear programs that requires only $\tilde{O}(\sqrt{rank(A)}L)$ iterations to solve a linear program with $m$ constraints, $n$ variables, and constraint matrix $A$, and bit complexity $L$. Each iteration of our method consists of solving $\tilde{O}(1)$ linear systems and additional nearly linear time computation. Our method improves upon the previous best iteration bound by factor of $\tildeΩ((m/rank(A))^{1/4})$ for methods with polynomial time computable iterations and by $\tildeΩ((m/rank(A))^{1/2})$ for methods which solve at most $\tilde{O}(1)$ linear systems in each iteration. Our method is parallelizable and amenable to linear algebraic techniques for accelerating the linear system solver. As such, up to polylogarithmic factors we either match or improve upon the best previous running times in both depth and work for different ratios of $m$ and $rank(A)$. Moreover, our method matches up to polylogarithmic factors a theoretical limit established by Nesterov and Nemirovski in 1994 regarding the use of a "universal barrier" for interior point methods, thereby resolving a long-standing open question regarding the running time of polynomial time interior point methods for linear programming.

DSMar 5, 2015

Path Finding II : An Õ(m sqrt(n)) Algorithm for the Minimum Cost Flow Problem

Yin Tat Lee, Aaron Sidford

In this paper we present an $\tilde{O}(m\sqrt{n}\log^{O(1)}U)$ time algorithm for solving the maximum flow problem on directed graphs with $m$ edges, $n$ vertices, and capacity ratio $U$. This improves upon the previous fastest running time of $O(m\min\left(n^{2/3},m^{1/2}\right)\log\left(n^{2}/m\right)\log U)$ achieved over 15 years ago by Goldberg and Rao. In the special case of solving dense directed unit capacity graphs our algorithm improves upon the previous fastest running times of of $O(\min\{m^{3/2},mn^{^{2/3}}\})$ achieved by Even and Tarjan and Karzanov over 35 years ago and of $\tilde{O}(m^{10/7})$ achieved recently by Mądry. We achieve these results through the development and application of a new general interior point method that we believe is of independent interest. The number of iterations required by this algorithm is better than that predicted by analyzing the best self-concordant barrier of the feasible region. By applying this method to the linear programming formulations of maximum flow, minimum cost flow, and lossy generalized minimum cost flow and applying analysis by Daitch and Spielman we achieve running time of $\tilde{O}(m\sqrt{n}\log^{O(1)}(U/ε))$ for these problems as well. Furthermore, our algorithm is parallelizable and using a recent nearly linear time work polylogarithmic depth Laplacian system solver of Spielman and Peng we achieve a $\tilde{O}(\sqrt{n}\log^{O(1)}(U/ε))$ depth algorithm and $\tilde{O}(m\sqrt{n}\log^{O(1)}(U/ε))$ work algorithm for solving these problems.

OCJan 1, 2023

ReSQueing Parallel and Private Stochastic Convex Optimization

Yair Carmon, Arun Jambulapati, Yujia Jin et al.

We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $ε_{\text{opt}}$ with $d^{1/3}ε_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}ε_{\text{opt}}^{-2/3} + ε_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $ε_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d ε_{\text{dp}}^{-2}$, $(ε_{\text{dp}}, δ)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 ε_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime.

LGMar 29, 2022

Efficient Convex Optimization Requires Superlinear Memory

Annie Marsden, Vatsal Sharan, Aaron Sidford et al.

We show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\mathrm{poly}(d)$ accuracy using at most $d^{1.25 - δ}$ bits of memory must make at least $\tildeΩ(d^{1 + (4/3)δ})$ first-order queries (for any constant $δ\in [0, 1/4]$). Consequently, the performance of such memory-constrained algorithms are a polynomial factor worse than the optimal $\tilde{O}(d)$ query bound for this problem obtained by cutting plane methods that use $\tilde{O}(d^2)$ memory. This resolves a COLT 2019 open problem of Woodworth and Srebro.

DSMar 8, 2022

Semi-Random Sparse Recovery in Nearly-Linear Time

Jonathan A. Kelner, Jerry Li, Allen Liu et al.

Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + ξ$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time. Our approach differs from prior fast iterative methods with provable guarantees under semi-random generative models: natural conditions on a submatrix which make sparse recovery tractable are NP-hard to verify. We design a new iterative method tailored to the geometry of sparse recovery which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for natural statistical inverse problems.

OCJun 17, 2022

RECAPP: Crafting a More Efficient Catalyst for Convex Optimization

Yair Carmon, Arun Jambulapati, Yujia Jin et al.

The accelerated proximal point algorithm (APPA), also known as "Catalyst", is a well-established reduction from convex optimization to approximate proximal point computation (i.e., regularized minimization). This reduction is conceptually elegant and yields strong convergence rate guarantees. However, these rates feature an extraneous logarithmic term arising from the need to compute each proximal point to high accuracy. In this work, we propose a novel Relaxed Error Criterion for Accelerated Proximal Point (RECAPP) that eliminates the need for high accuracy subproblem solutions. We apply RECAPP to two canonical problems: finite-sum and max-structured minimization. For finite-sum problems, we match the best known complexity, previously obtained by carefully-designed problem-specific algorithms. For minimizing $\max_y f(x,y)$ where $f$ is convex in $x$ and strongly-concave in $y$, we improve on the best known (Catalyst-based) bound by a logarithmic factor.

LGAug 7, 2023

Matrix Completion in Almost-Verification Time

Jonathan A. Kelner, Jerry Li, Allen Liu et al.

We give a new framework for solving the fundamental problem of low-rank matrix completion, i.e., approximating a rank-$r$ matrix $\mathbf{M} \in \mathbb{R}^{m \times n}$ (where $m \ge n$) from random observations. First, we provide an algorithm which completes $\mathbf{M}$ on $99\%$ of rows and columns under no further assumptions on $\mathbf{M}$ from $\approx mr$ samples and using $\approx mr^2$ time. Then, assuming the row and column spans of $\mathbf{M}$ satisfy additional regularity properties, we show how to boost this partial completion guarantee to a full matrix completion algorithm by aggregating solutions to regression problems involving the observations. In the well-studied setting where $\mathbf{M}$ has incoherent row and column spans, our algorithms complete $\mathbf{M}$ to high precision from $mr^{2+o(1)}$ observations in $mr^{3 + o(1)}$ time (omitting logarithmic factors in problem parameters), improving upon the prior state-of-the-art [JN15] which used $\approx mr^5$ samples and $\approx mr^7$ time. Under an assumption on the row and column spans of $\mathbf{M}$ we introduce (which is satisfied by random subspaces with high probability), our sample complexity improves to an almost information-theoretically optimal $mr^{1 + o(1)}$, and our runtime improves to $mr^{2 + o(1)}$. Our runtimes have the appealing property of matching the best known runtime to verify that a rank-$r$ decomposition $\mathbf{U}\mathbf{V}^\top$ agrees with the sampled observations. We also provide robust variants of our algorithms that, given random observations from $\mathbf{M} + \mathbf{N}$ with $\|\mathbf{N}\|_{F} \le Δ$, complete $\mathbf{M}$ to Frobenius norm distance $\approx r^{1.5}Δ$ in the same runtimes as the noiseless setting. Prior noisy matrix completion algorithms [CP10] only guaranteed a distance of $\approx \sqrt{n}Δ$.

DSNov 17, 2023

A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions

Yair Carmon, Arun Jambulapati, Yujia Jin et al.

We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$-dimensional Euclidean or simplex domain. When each $f_i$ is $1$-Lipschitz and $1$-smooth, our method computes an $ε$-approximate solution using $\widetilde{O}(n ε^{-1/3} + ε^{-2})$ gradient and function evaluations, and $\widetilde{O}(n ε^{-4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear -- which corresponds to finding a near-optimal primal strategy in a matrix game -- our method finds an $ε$-approximate solution in runtime $\widetilde{O}(n (d/ε)^{2/3} + nd + dε^{-2})$. For $n>d$ and $ε=1/\sqrt{n}$ this improves over all existing first-order methods. When additionally $d = ω(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for non-Euclidean geometry.

MLOct 13, 2022

On the Efficient Implementation of High Accuracy Optimality of Profile Maximum Likelihood

Moses Charikar, Zhihao Jiang, Kirankumar Shiragur et al.

We provide an efficient unified plug-in approach for estimating symmetric properties of distributions given $n$ independent samples. Our estimator is based on profile-maximum-likelihood (PML) and is sample optimal for estimating various symmetric properties when the estimation error $ε\gg n^{-1/3}$. This result improves upon the previous best accuracy threshold of $ε\gg n^{-1/4}$ achievable by polynomial time computable PML-based universal estimators [ACSS21, ACSS20]. Our estimator reaches a theoretical limit for universal symmetric property estimation as [Han21] shows that a broad class of universal estimators (containing many well known approaches including ours) cannot be sample optimal for every $1$-Lipschitz property when $ε\ll n^{-1/3}$.

DSOct 27, 2023

Structured Semidefinite Programming for Recovering Structured Preconditioners

Arun Jambulapati, Jerry Li, Christopher Musco et al.

We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $ε$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(κ^\star,ε^{-1}))$, where $κ^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $Ω(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $Ω(d^ω)$ where $ω> 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery.

DSJul 2, 2023

Moments, Random Walks, and Limits for Spectrum Approximation

Yujia Jin, Christopher Musco, Aaron Sidford et al.

We study lower bounds for the problem of approximating a one dimensional distribution given (noisy) measurements of its moments. We show that there are distributions on $[-1,1]$ that cannot be approximated to accuracy $ε$ in Wasserstein-1 distance even if we know \emph{all} of their moments to multiplicative accuracy $(1\pm2^{-Ω(1/ε)})$; this result matches an upper bound of Kong and Valiant [Annals of Statistics, 2017]. To obtain our result, we provide a hard instance involving distributions induced by the eigenvalue spectra of carefully constructed graph adjacency matrices. Efficiently approximating such spectra in Wasserstein-1 distance is a well-studied algorithmic problem, and a recent result of Cohen-Steiner et al. [KDD 2018] gives a method based on accurately approximating spectral moments using $2^{O(1/ε)}$ random walks initiated at uniformly random nodes in the graph. As a strengthening of our main result, we show that improving the dependence on $1/ε$ in this result would require a new algorithmic approach. Specifically, no algorithm can compute an $ε$-accurate approximation to the spectrum of a normalized graph adjacency matrix with constant probability, even when given the transcript of $2^{Ω(1/ε)}$ random walks of length $2^{Ω(1/ε)}$ started at random nodes.

LGMay 21, 2024

Truncated Variance Reduced Value Iteration

Yujia Jin, Ishani Karmarkar, Aaron Sidford et al.

We provide faster randomized algorithms for computing an $ε$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $γ$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - γ)^{-3}ε^{-2} + (1 - γ)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-γ)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - γ)^{-3}ε^{-2} + (1 - γ)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-γ)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

DSJun 11, 2024

Faster Spectral Density Estimation and Sparsification in the Nuclear Norm

Yujia Jin, Ishani Karmarkar, Christopher Musco et al.

We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $n$-node undirected graph. We provide a randomized algorithm that, with $O(nε^{-2})$ queries to a degree and neighbor oracle and in $O(nε^{-3})$ time, estimates the spectrum up to $ε$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $O(nε^{-7})$ time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small $ε$, a $2^{O(ε^{-1})}$ time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call nuclear sparsification. We provide an $O(nε^{-2})$-query and $O(nε^{-2})$-time algorithm for computing $O(nε^{-2})$-sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first deterministic algorithm for spectral density estimation that scales linearly with $n$ (sublinear in the representation size of the graph).

OCJun 11, 2024

Closing the Computational-Query Depth Gap in Parallel Stochastic Convex Optimization

Arun Jambulapati, Aaron Sidford, Kevin Tian

We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a stochastic subgradient oracle. The total number of queries made and the query depth, i.e., the number of parallel rounds of queries, match the prior state-of-the-art, [CJJLLST23], while improving upon the computational depth by a polynomial factor for sufficiently small accuracy. When combined with previous state-of-the-art methods our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms. Our method starts with a ball acceleration framework of previous parallel methods, i.e., [CJJJLST20, ACJJS21], which reduce the problem to minimizing a regularized Gaussian convolution of the function constrained to Euclidean balls. By developing and leveraging new stability properties of the Hessian of this induced function, we depart from prior parallel algorithms and reduce these ball-constrained optimization problems to stochastic unconstrained quadratic minimization problems. Although we are unable to prove concentration of the asymmetric matrices that we use to approximate this Hessian, we nevertheless develop an efficient parallel method for solving these quadratics. Interestingly, our algorithms can be improved using fast matrix multiplication and use nearly-linear work if the matrix multiplication exponent is 2.

OCFeb 9, 2022

Sharper Rates for Separable Minimax and Finite Sum Optimization via Primal-Dual Extragradient Methods

Yujia Jin, Aaron Sidford, Kevin Tian

We design accelerated algorithms with improved rates for several fundamental classes of optimization problems. Our algorithms all build upon techniques related to the analysis of primal-dual extragradient methods via relative Lipschitzness proposed recently by [CST21]. (1) Separable minimax optimization. We study separable minimax optimization problems $\min_x \max_y f(x) - g(y) + h(x, y)$, where $f$ and $g$ have smoothness and strong convexity parameters $(L^x, μ^x)$, $(L^y, μ^y)$, and $h$ is convex-concave with a $(Λ^{xx}, Λ^{xy}, Λ^{yy})$-blockwise operator norm bounded Hessian. We provide an algorithm with gradient query complexity $\tilde{O}\left(\sqrt{\frac{L^{x}}{μ^{x}}} + \sqrt{\frac{L^{y}}{μ^{y}}} + \frac{Λ^{xx}}{μ^{x}} + \frac{Λ^{xy}}{\sqrt{μ^{x}μ^{y}}} + \frac{Λ^{yy}}{μ^{y}}\right)$. Notably, for convex-concave minimax problems with bilinear coupling (e.g.\ quadratics), where $Λ^{xx} = Λ^{yy} = 0$, our rate matches a lower bound of [ZHZ19]. (2) Finite sum optimization. We study finite sum optimization problems $\min_x \frac{1}{n}\sum_{i\in[n]} f_i(x)$, where each $f_i$ is $L_i$-smooth and the overall problem is $μ$-strongly convex. We provide an algorithm with gradient query complexity $\tilde{O}\left(n + \sum_{i\in[n]} \sqrt{\frac{L_i}{nμ}} \right)$. Notably, when the smoothness bounds $\{L_i\}_{i\in[n]}$ are non-uniform, our rate improves upon accelerated SVRG [LMH15, FGKS15] and Katyusha [All17] by up to a $\sqrt{n}$ factor. (3) Minimax finite sums. We generalize our algorithms for minimax and finite sum optimization to solve a natural family of minimax finite sum optimization problems at an accelerated rate, encapsulating both above results up to a logarithmic factor.

OCNov 4, 2021

Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

Jonathan Kelner, Annie Marsden, Vatsal Sharan et al.

We provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components. This complexity bound (which we prove is nearly optimal) can improve almost exponentially on that of accelerated gradient methods, which grow as the square root of the condition number of $f$. Additionally, we provide efficient methods for solving stochastic, quadratic variants of this multiscale optimization problem. Rather than learn the decomposition of $f$ (which would be prohibitively expensive), our methods apply a clean recursive "Big-Step-Little-Step" interleaving of standard methods. The resulting algorithms use $\tilde{\mathcal{O}}(d m)$ space, are numerically stable, and open the door to a more fine-grained understanding of the complexity of convex optimization beyond condition number.

OCJun 17, 2021

Stochastic Bias-Reduced Gradient Methods

Hilal Asi, Yair Carmon, Arun Jambulapati et al.

We develop a new primitive for stochastic optimization: a low-bias, low-cost estimator of the minimizer $x_\star$ of any Lipschitz strongly-convex function. In particular, we use a multilevel Monte-Carlo approach due to Blanchet and Glynn to turn any optimal stochastic gradient method into an estimator of $x_\star$ with bias $δ$, variance $O(\log(1/δ))$, and an expected sampling cost of $O(\log(1/δ))$ stochastic gradient evaluations. As an immediate consequence, we obtain cheap and nearly unbiased gradient estimators for the Moreau-Yoshida envelope of any Lipschitz convex function, allowing us to perform dimension-free randomized smoothing. We demonstrate the potential of our estimator through four applications. First, we develop a method for minimizing the maximum of $N$ functions, improving on recent results and matching a lower bound up to logarithmic factors. Second and third, we recover state-of-the-art rates for projection-efficient and gradient-efficient optimization using simple algorithms with a transparent analysis. Finally, we show that an improved version of our estimator would yield a nearly linear-time, optimal-utility, differentially-private non-smooth stochastic optimization method.

LGJun 13, 2021

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Yujia Jin, Aaron Sidford

We prove new upper and lower bounds for sample complexity of finding an $ε$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} ε^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

OCMay 4, 2021

Thinking Inside the Ball: Near-Optimal Minimization of the Maximal Loss

Yair Carmon, Arun Jambulapati, Yujia Jin et al.

We characterize the complexity of minimizing $\max_{i\in[N]} f_i(x)$ for convex, Lipschitz functions $f_1,\ldots, f_N$. For non-smooth functions, existing methods require $O(Nε^{-2})$ queries to a first-order oracle to compute an $ε$-suboptimal point and $\tilde{O}(Nε^{-1})$ queries if the $f_i$ are $O(1/ε)$-smooth. We develop methods with improved complexity bounds of $\tilde{O}(Nε^{-2/3} + ε^{-8/3})$ in the non-smooth case and $\tilde{O}(Nε^{-2/3} + \sqrt{N}ε^{-1})$ in the $O(1/ε)$-smooth case. Our methods consist of a recently proposed ball optimization oracle acceleration algorithm (which we refine) and a careful implementation of said oracle for the softmax function. We also prove an oracle complexity lower bound scaling as $Ω(Nε^{-2/3})$, showing that our dependence on $N$ is optimal up to polylogarithmic factors.

OCNov 12, 2020

Relative Lipschitzness in Extragradient Methods and a Direct Recipe for Acceleration

Michael B. Cohen, Aaron Sidford, Kevin Tian

We show that standard extragradient methods (i.e. mirror prox and dual extrapolation) recover optimal accelerated rates for first-order minimization of smooth convex functions. To obtain this result we provide a fine-grained characterization of the convergence rates of extragradient methods for solving monotone variational inequalities in terms of a natural condition we call relative Lipschitzness. We further generalize this framework to handle local and randomized notions of relative Lipschitzness and thereby recover rates for box-constrained $\ell_\infty$ regression based on area convexity and complexity bounds achieved by accelerated (randomized) coordinate descent for smooth convex function minimization.

DSNov 5, 2020

Instance Based Approximations to Profile Maximum Likelihood

Nima Anari, Moses Charikar, Kirankumar Shiragur et al.

In this paper we provide a new efficient algorithm for approximately computing the profile maximum likelihood (PML) distribution, a prominent quantity in symmetric property estimation. We provide an algorithm which matches the previous best known efficient algorithms for computing approximate PML distributions and improves when the number of distinct observed frequencies in the given instance is small. We achieve this result by exploiting new sparsity structure in approximate PML distributions and providing a new matrix rounding algorithm, of independent interest. Leveraging this result, we obtain the first provable computationally efficient implementation of PseudoPML, a general framework for estimating a broad class of symmetric properties. Additionally, we obtain efficient PML-based estimators for distributions with small profile entropy, a natural instance-based complexity measure. Further, we provide a simpler and more practical PseudoPML implementation that matches the best-known theoretical guarantees of such an estimator and evaluate this method empirically.

OCOct 12, 2020

Large-Scale Methods for Distributionally Robust Optimization

Daniel Levy, Yair Carmon, John C. Duchi et al.

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $χ^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $χ^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $χ^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.

DSSep 17, 2020

Coordinate Methods for Matrix Games

Yair Carmon, Yujia Jin, Aaron Sidford et al.

We develop primal-dual coordinate methods for solving bilinear saddle-point problems of the form $\min_{x \in \mathcal{X}} \max_{y\in\mathcal{Y}} y^\top A x$ which contain linear programming, classification, and regression as special cases. Our methods push existing fully stochastic sublinear methods and variance-reduced methods towards their limits in terms of per-iteration complexity and sample complexity. We obtain nearly-constant per-iteration complexity by designing efficient data structures leveraging Taylor approximations to the exponential and a binomial heap. We improve sample complexity via low-variance gradient estimators using dynamic sampling distributions that depend on both the iterates and the magnitude of the matrix entries. Our runtime bounds improve upon those of existing primal-dual methods by a factor depending on sparsity measures of the $m$ by $n$ matrix $A$. For example, when rows and columns have constant $\ell_1/\ell_2$ norm ratios, we offer improvements by a factor of $m+n$ in the fully stochastic setting and $\sqrt{m+n}$ in the variance-reduced setting. We apply our methods to computational geometry problems, i.e. minimum enclosing ball, maximum inscribed ball, and linear regression, and obtain improved complexity bounds. For linear regression with an elementwise nonnegative matrix, our guarantees improve on exact gradient methods by a factor of $\sqrt{\mathrm{nnz}(A)/(m+n)}$.

LGAug 28, 2020

Efficiently Solving MDPs with Stochastic Mirror Descent

Yujia Jin, Aaron Sidford

We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $γ$-discounted MDP with $A_{tot}$ total state-action pairs our method computes an $ε$-optimal policy with an expected $\widetilde{O}((1-γ)^{-4} A_{tot} ε^{-2})$ samples, matching the previous state-of-the-art up to a $(1-γ)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

OCAug 4, 2020

Fast and Near-Optimal Diagonal Preconditioning

Arun Jambulapati, Jerry Li, Christopher Musco et al.

The convergence rates of iterative methods for solving a linear system $\mathbf{A} x = b$ typically depend on the condition number of the matrix $\mathbf{A}$. Preconditioning is a common way of speeding up these methods by reducing that condition number in a computationally inexpensive way. In this paper, we revisit the decades-old problem of how to best improve $\mathbf{A}$'s condition number by left or right diagonal rescaling. We make progress on this problem in several directions. First, we provide new bounds for the classic heuristic of scaling $\mathbf{A}$ by its diagonal values (a.k.a. Jacobi preconditioning). We prove that this approach reduces $\mathbf{A}$'s condition number to within a quadratic factor of the best possible scaling. Second, we give a solver for structured mixed packing and covering semidefinite programs (MPC SDPs) which computes a constant-factor optimal scaling for $\mathbf{A}$ in $\widetilde{O}(\text{nnz}(\mathbf{A}) \cdot \text{poly}(κ^\star))$ time; this matches the cost of solving the linear system after scaling up to a $\widetilde{O}(\text{poly}(κ^\star))$ factor. Third, we demonstrate that a sufficiently general width-independent MPC SDP solver would imply near-optimal runtimes for the scaling problems we consider, and natural variants concerned with measures of average conditioning. Finally, we highlight connections of our preconditioning techniques to semi-random noise models, as well as applications in reducing risk in several statistical regression models.

DSApr 6, 2020

The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood

Nima Anari, Moses Charikar, Kirankumar Shiragur et al.

In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d.\ samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n} \log n) \right)$, improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3} \log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter. We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are $\exp(O(k \log(N/k)))$ approximations to the permanent of $N \times N$ matrices with non-negative rank at most $k$, improving upon the previous known bounds of $\exp(O(N))$. To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$. As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations.

DSMar 2, 2020

A General Framework for Symmetric Property Estimation

Moses Charikar, Kirankumar Shiragur, Aaron Sidford

In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the easy region where empirical estimation works and the difficult region where more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution \cite{ADOS16} in this difficult region we obtain a symmetric property estimation framework that is sample complexity optimal for many properties in a broader parameter regime than previous universal estimation approaches based on PML. The resulting algorithms based on these pseudo PML distributions are also more practical.

DSOct 15, 2019

Principal Component Projection and Regression in Nearly Linear Time through Asymmetric SVRG

Yujia Jin, Aaron Sidford

Given a data matrix $\mathbf{A} \in \mathbb{R}^{n \times d}$, principal component projection (PCP) and principal component regression (PCR), i.e. projection and regression restricted to the top-eigenspace of $\mathbf{A}$, are fundamental problems in machine learning, optimization, and numerical analysis. In this paper we provide the first algorithms that solve these problems in nearly linear time for fixed eigenvalue distribution and large n. This improves upon previous methods which have superlinear running times when both the number of top eigenvalues and inverse gap between eigenspaces is large. We achieve our results by applying rational approximations to reduce PCP and PCR to solving asymmetric linear systems which we solve by a variant of SVRG. We corroborate these findings with preliminary empirical experiments.

LGAug 29, 2019

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

Aaron Sidford, Mengdi Wang, Lin F. Yang et al.

In this paper, we settle the sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors. Given a stochastic game with discount factor $γ\in(0,1)$ we provide an algorithm that computes an $ε$-optimal strategy with high-probability given $\tilde{O}((1 - γ)^{-3} ε^{-2})$ samples from the transition function for each state-action-pair. Our algorithm runs in time nearly linear in the number of samples and uses space nearly linear in the number of state-action pairs. As stochastic games generalize Markov decision processes (MDPs) our runtime and sample complexities are optimal due to Azar et al (2013). We achieve our results by showing how to generalize a near-optimal Q-learning based algorithms for MDP, in particular Sidford et al (2018), to two-player strategy computation algorithms. This overcomes limitations of standard Q-learning and strategy iteration or alternating minimization based approaches and we hope will pave the way for future reinforcement learning results by facilitating the extension of MDP results to multi-agent settings with little loss.

OCJul 3, 2019

Variance Reduction for Matrix Games

Yair Carmon, Yujia Jin, Aaron Sidford et al.

We present a randomized primal-dual algorithm that solves the problem $\min_{x} \max_{y} y^\top A x$ to additive error $ε$ in time $\mathrm{nnz}(A) + \sqrt{\mathrm{nnz}(A)n}/ε$, for matrix $A$ with larger dimension $n$ and $\mathrm{nnz}(A)$ nonzero entries. This improves the best known exact gradient methods by a factor of $\sqrt{\mathrm{nnz}(A)/n}$ and is faster than fully stochastic gradient methods in the accurate and/or sparse regime $ε\le \sqrt{n/\mathrm{nnz}(A)}$. Our results hold for $x,y$ in the simplex (matrix games, linear programming) and for $x$ in an $\ell_2$ ball and $y$ in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines Nemirovski's "conceptual prox-method" and a novel reduced-variance gradient estimator based on "sampling from the difference" between the current iterate and a reference point.

OCJun 27, 2019

Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

Oliver Hinder, Aaron Sidford, Nimit S. Sohoni

In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $γ\in (0,1]$, where $γ= 1$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $γ$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $ε$-approximate minimizer of a smooth $γ$-quasar-convex function with at most $O(γ^{-1} ε^{-1/2} \log(γ^{-1} ε^{-1}))$ total function and gradient evaluations. We also derive a lower bound of $Ω(γ^{-1} ε^{-1/2})$ on the worst-case number of gradient evaluations required by any deterministic first-order method, showing that, up to a logarithmic factor, no deterministic first-order method can improve upon ours.

OCJun 25, 2019

Complexity of Highly Parallel Non-Smooth Convex Optimization

Sébastien Bubeck, Qijia Jiang, Yin Tat Lee et al.

A landmark result of non-smooth convex optimization is that gradient descent is an optimal algorithm whenever the number of computed gradients is smaller than the dimension $d$. In this paper we study the extension of this result to the parallel optimization setting. Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer $\mathrm{poly}(d)$ gradient queries in parallel. We show that in this case gradient descent is optimal only up to $\tilde{O}(\sqrt{d})$ rounds of interactions with the oracle. The lower bound improves upon a decades old construction by Nemirovski which proves optimality only up to $d^{1/3}$ rounds (as recently observed by Balkanski and Singer), and the suboptimality of gradient descent after $\sqrt{d}$ rounds was already observed by Duchi, Bartlett and Wainwright. In the latter regime we propose a new method with improved complexity, which we conjecture to be optimal. The analysis of this new method is based upon a generalized version of the recent results on optimal acceleration for highly smooth convex optimization.

DSJun 3, 2019

A Direct $\tilde{O}(1/ε)$ Iteration Parallel Algorithm for Optimal Transport

Arun Jambulapati, Aaron Sidford, Kevin Tian

Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $ε$ with $\tilde{O}(1/ε)$ parallel depth, and $\tilde{O}\left(n^2/ε\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem, this is optimal for first-order methods. Blanchet et. al. '18, Quanrud '19 obtained similar runtimes through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use complicated subroutines which may be deemed impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). The fastest practical algorithms run in time $\tilde{O}(\min(n^2 / ε^2, n^{2.5} / ε))$ (Dvurechensky et. al. '18, Lin et. al. '19). We bridge this gap by providing a parallel, first-order, $\tilde{O}(1/ε)$ iteration algorithm without worse dependence on dimension, and provide preliminary experimental evidence that our algorithm may enjoy improved practical performance. We obtain this runtime via a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow (Sherman '17).

DSMay 21, 2019

Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation

Moses Charikar, Kirankumar Shiragur, Aaron Sidford

Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes \emph{profile likelihood (PML)}, i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to this work there was no known polynomial time algorithm to compute an approximate PML and it was open to obtain a polynomial time universal plug-in estimator through the use of approximate PML. In this paper we provide a algorithm (in number of samples) that, given $n$ samples from a distribution, computes an approximate PML distribution up to a multiplicative error of $\exp(n^{2/3} \mathrm{poly} \log(n))$ in time nearly linear in $n$. Generalizing work of Acharya et al. 2016 on the utility of approximate PML we show that our algorithm provides a nearly linear time universal plug-in estimator for all symmetric functions up to accuracy $ε= Ω(n^{-0.166})$. Further, we show how to extend our work to provide efficient polynomial-time algorithms for computing a $d$-dimensional generalization of PML (for constant $d$) that allows for universal plug-in estimation of symmetric relationships between distributions.

LGApr 18, 2019

Memory-Sample Tradeoffs for Linear Regression with Small Error

Vatsal Sharan, Aaron Sidford, Gregory Valiant

We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)\ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotropic Gaussian, and where $b_i = \langle a_i, x\rangle + η_i,$ for a fixed $x \in \mathbb{R}^d$ with $\|x\|_2 = 1$ and with independent noise $η_i$ drawn uniformly from the interval $[-2^{-d/5},2^{-d/5}].$ We show that any algorithm with at most $d^2/4$ bits of memory requires at least $Ω(d \log \log \frac{1}ε)$ samples to approximate $x$ to $\ell_2$ error $ε$ with probability of success at least $2/3$, for $ε$ sufficiently small as a function of $d$. In contrast, for such $ε$, $x$ can be recovered to error $ε$ with probability $1-o(1)$ with memory $O\left(d^2 \log(1/ε)\right)$ using $d$ examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.

LGMar 7, 2019

A Rank-1 Sketch for Matrix Multiplicative Weights

Yair Carmon, John C. Duchi, Aaron Sidford et al.

We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a $\textit{randomized mirror projection}$, and perform mirror descent analysis on the $\textit{expected projection}$. Our sketch solves the online eigenvector problem, improving the best known complexity bounds by $Ω(\log^5 n)$. We also apply this sketch to semidefinite programming in saddle-point form, yielding a simple primal-dual scheme with guarantees matching the best in the literature.

DSNov 27, 2018

Exploiting Numerical Sparsity for Efficient Learning : Faster Eigenvector Computation and Regression

Neha Gupta, Aaron Sidford

In this paper, we obtain improved running times for regression and top eigenvector computation for numerically sparse matrices. Given a data matrix $A \in \mathbb{R}^{n \times d}$ where every row $a \in \mathbb{R}^d$ has $\|a\|_2^2 \leq L$ and numerical sparsity at most $s$, i.e. $\|a\|_1^2 / \|a\|_2^2 \leq s$, we provide faster algorithms for these problems in many parameter settings. For top eigenvector computation, we obtain a running time of $\tilde{O}(nd + r(s + \sqrt{r s}) / \mathrm{gap}^2)$ where $\mathrm{gap} > 0$ is the relative gap between the top two eigenvectors of $A^\top A$ and $r$ is the stable rank of $A$. This running time improves upon the previous best unaccelerated running time of $O(nd + r d / \mathrm{gap}^2)$ as it is always the case that $r \leq d$ and $s \leq d$. For regression, we obtain a running time of $\tilde{O}(nd + (nL / μ) \sqrt{s nL / μ})$ where $μ> 0$ is the smallest eigenvalue of $A^\top A$. This running time improves upon the previous best unaccelerated running time of $\tilde{O}(nd + n L d / μ)$. This result expands the regimes where regression can be solved in nearly linear time from when $L/μ= \tilde{O}(1)$ to when $L / μ= \tilde{O}(d^{2/3} / (sn)^{1/3})$. Furthermore, we obtain similar improvements even when row norms and numerical sparsities are non-uniform and we show how to achieve even faster running times by accelerating using approximate proximal point [Frostig et. al. 2015] / catalyst [Lin et. al. 2015]. Our running times depend only on the size of the input and natural numerical measures of the matrix, i.e. eigenvalues and $\ell_p$ norms, making progress on a key open problem regarding optimal running times for efficient large-scale learning.

MLNov 22, 2017

Leverage Score Sampling for Faster Accelerated Regression and ERM

Naman Agarwal, Sham Kakade, Rahul Kidambi et al.

Given a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ and a vector $b \in\mathbb{R}^{d}$, we show how to compute an $ε$-approximate solution to the regression problem $ \min_{x\in\mathbb{R}^{d}}\frac{1}{2} \|\mathbf{A} x - b\|_{2}^{2} $ in time $ \tilde{O} ((n+\sqrt{d\cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1}) $ where $κ_{\text{sum}}=\mathrm{tr}\left(\mathbf{A}^{\top}\mathbf{A}\right)/λ_{\min}(\mathbf{A}^{T}\mathbf{A})$ and $s$ is the maximum number of non-zero entries in a row of $\mathbf{A}$. Our algorithm improves upon the previous best running time of $ \tilde{O} ((n+\sqrt{n \cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1})$. We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.

DSOct 27, 2017

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Aaron Sidford, Mengdi Wang, Xian Wu et al.

In this paper we provide faster algorithms for approximately solving discounted Markov Decision Processes in multiple parameter regimes. Given a discounted Markov Decision Process (DMDP) with $|S|$ states, $|A|$ actions, discount factor $γ\in(0,1)$, and rewards in the range $[-M, M]$, we show how to compute an $ε$-optimal policy, with probability $1 - δ$ in time \[ \tilde{O}\left( \left(|S|^2 |A| + \frac{|S| |A|}{(1 - γ)^3} \right) \log\left( \frac{M}ε \right) \log\left( \frac{1}δ \right) \right) ~ . \] This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of $γ$. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in $O(1)$ time. Under this assumption we provide an algorithm which computes an $ε$-optimal policy with probability $1 - δ$ in time \[ \tilde{O} \left(\frac{|S| |A| M^2}{(1 - γ)^4 ε^2} \log \left(\frac{1}δ\right) \right) ~. \] Lastly, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed-horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.

MLOct 25, 2017

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

Prateek Jain, Sham M. Kakade, Rahul Kidambi et al.

This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.

MLApr 26, 2017

Accelerating Stochastic Gradient Descent For Least Squares Regression

Prateek Jain, Sham M. Kakade, Rahul Kidambi et al.

There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes the conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.

DSApr 13, 2017

Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and Hardness

Cameron Musco, Praneeth Netrapalli, Aaron Sidford et al.

Understanding the singular value spectrum of a matrix $A \in \mathbb{R}^{n \times n}$ is a fundamental task in countless applications. In matrix multiplication time, it is possible to perform a full SVD and directly compute the singular values $σ_1,...,σ_n$. However, little is known about algorithms that break this runtime barrier. Using tools from stochastic trace estimation, polynomial approximation, and fast system solvers, we show how to efficiently isolate different ranges of $A$'s spectrum and approximate the number of singular values in these ranges. We thus effectively compute a histogram of the spectrum, which can stand in for the true singular values in many applications. We use this primitive to give the first algorithms for approximating a wide class of symmetric matrix norms in faster than matrix multiplication time. For example, we give a $(1 + ε)$ approximation algorithm for the Schatten-$1$ norm (the nuclear norm) running in just $\tilde O((nnz(A)n^{1/3} + n^2)ε^{-3})$ time for $A$ with uniform row sparsity or $\tilde O(n^{2.18} ε^{-3})$ time for dense matrices. The runtime scales smoothly for general Schatten-$p$ norms, notably becoming $\tilde O (p \cdot nnz(A) ε^{-3})$ for any $p \ge 2$. At the same time, we show that the complexity of spectrum approximation is inherently tied to fast matrix multiplication in the small $ε$ regime. We prove that achieving milder $ε$ dependencies in our algorithms would imply faster than matrix multiplication time triangle detection for general graphs. This further implies that highly accurate algorithms running in subcubic time yield subcubic time matrix multiplication. As an application of our bounds, we show that precisely computing all effective resistances in a graph in less than matrix multiplication time is likely difficult, barring a major algorithmic breakthrough.

MLOct 12, 2016

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification

Prateek Jain, Sham M. Kakade, Rahul Kidambi et al.

This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of an SGD method. These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD methods. A non-asymptotic analysis of communication efficient parallelization schemes such as model-averaging/parameter mixing methods is then provided. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure minimax risk for the agnostic case must be a function of the noise properties. This paper builds on the operator view of analyzing SGD methods, introduced by Defossez and Bach (2015), followed by developing a novel analysis in bounding these operators to characterize the excess risk. These techniques are of broader interest in analyzing computational aspects of stochastic approximation.

DSMay 26, 2016

Faster Eigenvector Computation via Shift-and-Invert Preconditioning

Dan Garber, Elad Hazan, Chi Jin et al.

We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix $Σ$ -- i.e. computing a unit vector $x$ such that $x^T Σx \ge (1-ε)λ_1(Σ)$: Offline Eigenvector Estimation: Given an explicit $A \in \mathbb{R}^{n \times d}$ with $Σ= A^TA$, we show how to compute an $ε$ approximate top eigenvector in time $\tilde O([nnz(A) + \frac{d*sr(A)}{gap^2} ]* \log 1/ε)$ and $\tilde O([\frac{nnz(A)^{3/4} (d*sr(A))^{1/4}}{\sqrt{gap}} ] * \log 1/ε)$. Here $nnz(A)$ is the number of nonzeros in $A$, $sr(A)$ is the stable rank, $gap$ is the relative eigengap. By separating the $gap$ dependence from the $nnz(A)$ term, our first runtime improves upon the classical power and Lanczos methods. It also improves prior work using fast subspace embeddings [AC09, CW13] and stochastic optimization [Sha15c], giving significantly better dependencies on $sr(A)$ and $ε$. Our second running time improves these further when $nnz(A) \le \frac{d*sr(A)}{gap^2}$. Online Eigenvector Estimation: Given a distribution $D$ with covariance matrix $Σ$ and a vector $x_0$ which is an $O(gap)$ approximate top eigenvector for $Σ$, we show how to refine to an $ε$ approximation using $ O(\frac{var(D)}{gap*ε})$ samples from $D$. Here $var(D)$ is a natural notion of variance. Combining our algorithm with previous work to initialize $x_0$, we obtain improved sample complexity and runtime results under a variety of assumptions on $D$. We achieve our results using a general framework that we believe is of independent interest. We give a robust analysis of the classic method of shift-and-invert preconditioning to reduce eigenvector computation to approximately solving a sequence of linear systems. We then apply fast stochastic variance reduced gradient (SVRG) based system solvers to achieve our claims.

LGApr 13, 2016

Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

Rong Ge, Chi Jin, Sham M. Kakade et al.

This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-$k$ generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of $O(\frac{z k \sqrtκ}ρ \log(1/ε) \log \left(kκ/ρ\right))$ where $z$ is the total number of nonzero entries, $κ$ is the condition number and $ρ$ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components $k$ up to a $\log(k)$ factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

LGFeb 22, 2016

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm

Prateek Jain, Chi Jin, Sham M. Kakade et al.

This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = Σ$ for $Σ\succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $Σ$. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance $\frac{1}{n} \sum_{i \in [n]} A_i$ as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of $O(d)$ or $1/\mathrm{gap}$ where $\mathrm{gap}$ is the relative distance between the top two eigenvalues of $Σ$. These results are achieved through a novel analysis of the classic Oja's algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point $w_0$ and applying the update rule $w_{i + 1} = w_i + η_i A_i w_i$ suffices to accurately estimate the top eigenvector, with a suitable choice of $η_i$. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.

DSFeb 22, 2016

Principal Component Projection Without Principal Component Analysis

Roy Frostig, Cameron Musco, Christopher Musco et al.

We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a "smooth projection" onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.