OCJan 1, 2023
ReSQueing Parallel and Private Stochastic Convex OptimizationYair Carmon, Arun Jambulapati, Yujia Jin et al.
We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $ε_{\text{opt}}$ with $d^{1/3}ε_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}ε_{\text{opt}}^{-2/3} + ε_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $ε_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d ε_{\text{dp}}^{-2}$, $(ε_{\text{dp}}, δ)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 ε_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime.
LGJul 18, 2022
Private Convex Optimization in General NormsSivakanth Gopi, Yin Tat Lee, Daogao Liu et al.
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $ε\to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound.
DSFeb 13, 2023
Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal SamplerSivakanth Gopi, Yin Tat Lee, Daogao Liu et al.
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.
DSMar 8, 2022
Semi-Random Sparse Recovery in Nearly-Linear TimeJonathan A. Kelner, Jerry Li, Allen Liu et al.
Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + ξ$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time. Our approach differs from prior fast iterative methods with provable guarantees under semi-random generative models: natural conditions on a submatrix which make sparse recovery tractable are NP-hard to verify. We design a new iterative method tailored to the geometry of sparse recovery which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for natural statistical inverse problems.
LGAug 7, 2023
Matrix Completion in Almost-Verification TimeJonathan A. Kelner, Jerry Li, Allen Liu et al.
We give a new framework for solving the fundamental problem of low-rank matrix completion, i.e., approximating a rank-$r$ matrix $\mathbf{M} \in \mathbb{R}^{m \times n}$ (where $m \ge n$) from random observations. First, we provide an algorithm which completes $\mathbf{M}$ on $99\%$ of rows and columns under no further assumptions on $\mathbf{M}$ from $\approx mr$ samples and using $\approx mr^2$ time. Then, assuming the row and column spans of $\mathbf{M}$ satisfy additional regularity properties, we show how to boost this partial completion guarantee to a full matrix completion algorithm by aggregating solutions to regression problems involving the observations. In the well-studied setting where $\mathbf{M}$ has incoherent row and column spans, our algorithms complete $\mathbf{M}$ to high precision from $mr^{2+o(1)}$ observations in $mr^{3 + o(1)}$ time (omitting logarithmic factors in problem parameters), improving upon the prior state-of-the-art [JN15] which used $\approx mr^5$ samples and $\approx mr^7$ time. Under an assumption on the row and column spans of $\mathbf{M}$ we introduce (which is satisfied by random subspaces with high probability), our sample complexity improves to an almost information-theoretically optimal $mr^{1 + o(1)}$, and our runtime improves to $mr^{2 + o(1)}$. Our runtimes have the appealing property of matching the best known runtime to verify that a rank-$r$ decomposition $\mathbf{U}\mathbf{V}^\top$ agrees with the sampled observations. We also provide robust variants of our algorithms that, given random observations from $\mathbf{M} + \mathbf{N}$ with $\|\mathbf{N}\|_{F} \le Δ$, complete $\mathbf{M}$ to Frobenius norm distance $\approx r^{1.5}Δ$ in the same runtimes as the noiseless setting. Prior noisy matrix completion algorithms [CP10] only guaranteed a distance of $\approx \sqrt{n}Δ$.
DSOct 27, 2023
Structured Semidefinite Programming for Recovering Structured PreconditionersArun Jambulapati, Jerry Li, Christopher Musco et al.
We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $ε$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(κ^\star,ε^{-1}))$, where $κ^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $Ω(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $Ω(d^ω)$ where $ω> 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery.
MLMar 3
Combinatorial Sparse PCA Beyond the Spiked Identity ModelSyamantak Kumar, Purnamrita Sarkar, Kevin Tian et al.
Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$. We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.
CLFeb 21, 2018Code
CoVeR: Learning Covariate-Specific Vector Representations with Tensor DecompositionsKevin Tian, Teng Zhang, James Zou
Word embedding is a useful approach to capture co-occurrence structures in large text corpora. However, in addition to the text data itself, we often have additional covariates associated with individual corpus documents---e.g. the demographic of the author, time and venue of publication---and we would like the embedding to naturally capture this information. We propose CoVeR, a new tensor decomposition model for vector embeddings with covariates. CoVeR jointly learns a \emph{base} embedding for all the words as well as a weighted diagonal matrix to model how each covariate affects the base embedding. To obtain author or venue-specific embedding, for example, we can then simply multiply the base embedding by the associated transformation matrix. The main advantages of our approach are data efficiency and interpretability of the covariate transformation. Our experiments demonstrate that our joint model learns substantially better covariate-specific embeddings compared to the standard approach of learning a separate embedding for each covariate using only the relevant subset of data, as well as other related methods. Furthermore, CoVeR encourages the embeddings to be "topic-aligned" in that the dimensions have specific independent meanings. This allows our covariate-specific embeddings to be compared by topic, enabling downstream differential analysis. We empirically evaluate the benefits of our algorithm on datasets, and demonstrate how it can be used to address many natural questions about covariate effects. Accompanying code to this paper can be found at http://github.com/kjtian/CoVeR.
DSFeb 19
Simultaneous Blackwell Approachability and Applications to Multiclass OmnipredictionLunjia Hu, Kevin Tian, Chutong Yang
Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses $\mathcal{L}$ against a family of comparator predictors $\mathcal{C}$. We initiate the study of omniprediction in a multiclass setting, where the comparator family $\mathcal{C}$ may be infinite. Our main result is an extension of the recent binary omniprediction algorithm of [OKK25] to the multiclass setting, with sample complexity (in statistical settings) or regret horizon (in online settings) $\approx \varepsilon^{-(k+1)}$, for $\varepsilon$-omniprediction in a $k$-class prediction problem. En route to proving this result, we design a framework of potential broader interest for solving Blackwell approachability problems where multiple sets must simultaneously be approached via coupled actions.
STFeb 18
Separating Oblivious and Adaptive Models of Variable SelectionZiyun Chen, Jerry Li, Kevin Tian et al.
Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with $\ell_\infty$ error guarantees. This variant of the problem is motivated by \emph{variable selection} tasks, where the goal is to estimate the support of a $k$-sparse signal in $\mathbb{R}^d$. Our main contribution is a provable separation between the \emph{oblivious} (``for each'') and \emph{adaptive} (``for all'') models of $\ell_\infty$ sparse recovery. We show that under an oblivious model, the optimal $\ell_\infty$ error is attainable in near-linear time with $\approx k\log d$ samples, whereas in an adaptive model, $\gtrsim k^2$ samples are necessary for any algorithm to achieve this bound. This establishes a surprising contrast with the standard $\ell_2$ setting, where $\approx k \log d$ samples suffice even for adaptive sparse recovery. We conclude with a preliminary examination of a \emph{partially-adaptive} model, where we show nontrivial variable selection guarantees are possible with $\approx k\log d$ measurements.
LGFeb 20, 2024
Testing Calibration in Nearly-Linear TimeLunjia Hu, Arun Jambulapati, Kevin Tian et al.
In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $Ω(n^ω)$ time, where $ω> 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.
LGJan 16, 2025
Learning Noisy Halfspaces with a Margin: Massart is No Harder than RandomGautam Chandrasekaran, Vasilis Kontonis, Konstantinos Stavropoulos et al.
We study the problem of PAC learning $γ$-margin halfspaces with Massart noise. We propose a simple proper learning algorithm, the Perspectron, that has sample complexity $\widetilde{O}((εγ)^{-2})$ and achieves classification error at most $η+ε$ where $η$ is the Massart noise rate. Prior works [DGT19,CKMY20] came with worse sample complexity guarantees (in both $ε$ and $γ$) or could only handle random classification noise [DDK+23,KIT+23] -- a much milder noise assumption. We also show that our results extend to the more challenging setting of learning generalized linear models with a known link function under Massart noise, achieving a similar sample complexity to the halfspace case. This significantly improves upon the prior state-of-the-art in this setting due to [CKMY20], who introduced this model.
NAMar 6, 2024
Black-Box $k$-to-$1$-PCA Reductions: Theory and ApplicationsArun Jambulapati, Syamantak Kumar, Jerry Li et al. · cmu
The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have black-box access to via samples. Motivated by these settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work in sample complexity by a $\mathsf{poly}(k)$ factor.
LGNov 20, 2024
Omnipredicting Single-Index Models with Multi-Index ModelsLunjia Hu, Kevin Tian, Chutong Yang
Recent work on supervised learning [GKR+22] defined the notion of omnipredictors, i.e., predictor functions $p$ over features that are simultaneously competitive for minimizing a family of loss functions $\mathcal{L}$ against a comparator class $\mathcal{C}$. Omniprediction requires approximating the Bayes-optimal predictor beyond the loss minimization paradigm, and has generated significant interest in the learning theory community. However, even for basic settings such as agnostically learning single-index models (SIMs), existing omnipredictor constructions require impractically-large sample complexities and runtimes, and output complex, highly-improper hypotheses. Our main contribution is a new, simple construction of omnipredictors for SIMs. We give a learner outputting an omnipredictor that is $\varepsilon$-competitive on any matching loss induced by a monotone, Lipschitz link function, when the comparator class is bounded linear predictors. Our algorithm requires $\approx \varepsilon^{-4}$ samples and runs in nearly-linear time, and its sample complexity improves to $\approx \varepsilon^{-2}$ if link functions are bi-Lipschitz. This significantly improves upon the only prior known construction, due to [HJKRR18, GHK+23], which used $\gtrsim \varepsilon^{-10}$ samples. We achieve our construction via a new, sharp analysis of the classical Isotron algorithm [KS09, KKKS11] in the challenging agnostic learning setting, of potential independent interest. Previously, Isotron was known to properly learn SIMs in the realizable setting, as well as constant-factor competitive hypotheses under the squared loss [ZWDD24]. As they are based on Isotron, our omnipredictors are multi-index models with $\approx \varepsilon^{-2}$ prediction heads, bringing us closer to the tantalizing goal of proper omniprediction for general loss families and comparators.
MLMar 4, 2025
Spike-and-Slab Posterior Sampling in High DimensionsSyamantak Kumar, Purnamrita Sarkar, Kevin Tian et al.
Posterior sampling with the spike-and-slab prior [MB88], a popular multimodal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression [CPS09, Roc18]. However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) [YWJ16], only work when the measurement count grows at least linearly in the dimension [MW24], or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix $\mathbf{X} \in \mathbb{R}^{n\times d}$ and noisy observations $\mathbf{y} = \mathbf{X}\mathbfθ^\star + \mathbfξ$ of a signal $\mathbfθ^\star$ drawn from a spike-and-slab prior $π$ with a Gaussian diffuse density and expected sparsity k, where $\mathbfξ \sim \mathcal{N}(\mathbb{0}_n, σ^2\mathbf{I}_n)$. We give a polynomial-time high-accuracy sampler for the posterior $π(\cdot \mid \mathbf{X}, \mathbf{y})$, for any SNR $σ^{-1}$ > 0, as long as $n \geq k^3 \cdot \text{polylog}(d)$ and $X$ is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time $\approx nd$ in the same setting, as long as $n \geq k^5 \cdot \text{polylog}(d)$. To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when $σ= O(\frac{1}{k})$ is bounded.
CLMar 1, 2025
More of the Same: Persistent Representational Harms Under Increased RepresentationJennifer Mickel, Maria De-Arteaga, Leqi Liu et al.
To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
PRFeb 3
Functional Stochastic LocalizationAnming Gu, Bobby Shi, Kevin Tian
Eldan's stochastic localization is a probabilistic construction that has proved instrumental to modern breakthroughs in high-dimensional geometry and the design of sampling algorithms. Motivated by sampling under non-Euclidean geometries and the mirror descent algorithm in optimization, we develop a functional generalization of Eldan's process that replaces Gaussian regularization with regularization by any positive integer multiple of a log-Laplace transform. We further give a mixing time bound on the Markov chain induced by our localization process, which holds if our target distribution satisfies a functional Poincaré inequality. Finally, we apply our framework to differentially private convex optimization in $\ell_p$ norms for $p \in [1, 2)$, where we improve state-of-the-art query complexities in a zeroth-order model.
PROct 6, 2025
Perspectives on Stochastic LocalizationBobby Shi, Kevin Tian, Matthew S. Zhang
We survey different perspectives on the stochastic localization process of [Eld13], a powerful construction that has had many exciting recent applications in high-dimensional probability and algorithm design. Unlike prior surveys on this topic, our focus is on giving a self-contained presentation of all known alternative constructions of Eldan's stochastic localization, with an emphasis on connections between different constructions. Our hope is that by collecting these perspectives, some of which had primarily arisen within a particular community (e.g., probability theory, theoretical computer science, information theory, or machine learning), we can broaden the accessibility of stochastic localization, and ease its future use.
OCJun 11, 2024
Closing the Computational-Query Depth Gap in Parallel Stochastic Convex OptimizationArun Jambulapati, Aaron Sidford, Kevin Tian
We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a stochastic subgradient oracle. The total number of queries made and the query depth, i.e., the number of parallel rounds of queries, match the prior state-of-the-art, [CJJLLST23], while improving upon the computational depth by a polynomial factor for sufficiently small accuracy. When combined with previous state-of-the-art methods our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms. Our method starts with a ball acceleration framework of previous parallel methods, i.e., [CJJJLST20, ACJJS21], which reduce the problem to minimizing a regularized Gaussian convolution of the function constrained to Euclidean balls. By developing and leveraging new stability properties of the Hessian of this induced function, we depart from prior parallel algorithms and reduce these ball-constrained optimization problems to stochastic unconstrained quadratic minimization problems. Although we are unable to prove concentration of the asymmetric matrices that we use to approximate this Hessian, we nevertheless develop an efficient parallel method for solving these quadratics. Interestingly, our algorithms can be improved using fast matrix multiplication and use nearly-linear work if the matrix multiplication exponent is 2.
DSJun 4, 2024
Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple ReductionsHilal Asi, Daogao Liu, Kevin Tian
We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a $k^{\text{th}}$-moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving error $G_2 \cdot \frac 1 {\sqrt n} + G_k \cdot (\frac{\sqrt d}{nε})^{1 - \frac 1 k}$ under $(ε, δ)$-approximate differential privacy, up to a mild $\textup{polylog}(\frac{1}δ)$ factor, where $G_2^2$ and $G_k^k$ are the $2^{\text{nd}}$ and $k^{\text{th}}$ moment bounds on sample Lipschitz constants, nearly-matching a lower bound of [Lowy and Razaviyayn 2023]. We further give a suite of private algorithms in the heavy-tailed setting which improve upon our basic result under additional assumptions, including an optimal algorithm under a known-Lipschitz constant assumption, a near-linear time algorithm for smooth functions, and an optimal linear time algorithm for smooth generalized linear models.
OCFeb 9, 2022
Sharper Rates for Separable Minimax and Finite Sum Optimization via Primal-Dual Extragradient MethodsYujia Jin, Aaron Sidford, Kevin Tian
We design accelerated algorithms with improved rates for several fundamental classes of optimization problems. Our algorithms all build upon techniques related to the analysis of primal-dual extragradient methods via relative Lipschitzness proposed recently by [CST21]. (1) Separable minimax optimization. We study separable minimax optimization problems $\min_x \max_y f(x) - g(y) + h(x, y)$, where $f$ and $g$ have smoothness and strong convexity parameters $(L^x, μ^x)$, $(L^y, μ^y)$, and $h$ is convex-concave with a $(Λ^{xx}, Λ^{xy}, Λ^{yy})$-blockwise operator norm bounded Hessian. We provide an algorithm with gradient query complexity $\tilde{O}\left(\sqrt{\frac{L^{x}}{μ^{x}}} + \sqrt{\frac{L^{y}}{μ^{y}}} + \frac{Λ^{xx}}{μ^{x}} + \frac{Λ^{xy}}{\sqrt{μ^{x}μ^{y}}} + \frac{Λ^{yy}}{μ^{y}}\right)$. Notably, for convex-concave minimax problems with bilinear coupling (e.g.\ quadratics), where $Λ^{xx} = Λ^{yy} = 0$, our rate matches a lower bound of [ZHZ19]. (2) Finite sum optimization. We study finite sum optimization problems $\min_x \frac{1}{n}\sum_{i\in[n]} f_i(x)$, where each $f_i$ is $L_i$-smooth and the overall problem is $μ$-strongly convex. We provide an algorithm with gradient query complexity $\tilde{O}\left(n + \sum_{i\in[n]} \sqrt{\frac{L_i}{nμ}} \right)$. Notably, when the smoothness bounds $\{L_i\}_{i\in[n]}$ are non-uniform, our rate improves upon accelerated SVRG [LMH15, FGKS15] and Katyusha [All17] by up to a $\sqrt{n}$ factor. (3) Minimax finite sums. We generalize our algorithms for minimax and finite sum optimization to solve a natural family of minimax finite sum optimization problems at an accelerated rate, encapsulating both above results up to a logarithmic factor.
DSJun 22, 2021
Robust Regression Revisited: Acceleration and Improved Estimation RatesArun Jambulapati, Jerry Li, Tselil Schramm et al.
We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the robust gradient descent framework of Prasad et. al., a first-order method using biased gradient queries, or the Sever framework of Diakonikolas et. al., an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art. For the general case of smooth GLMs (e.g. logistic regression), we show that the robust gradient descent framework of Prasad et. al. can be accelerated, and show our algorithm extends to optimizing the Moreau envelopes of Lipschitz GLMs (e.g. support vector machines), answering several open questions in the literature. For the well-studied case of robust linear regression, we present an alternative approach obtaining improved estimation rates over prior nearly-linear time algorithms. Interestingly, our method starts with an identifiability proof introduced in the context of the sum-of-squares algorithm of Bakshi and Prasad, which achieved optimal error rates while requiring large polynomial runtime and sample complexity. We reinterpret their proof within the Sever framework and obtain a dramatically faster and more sample-efficient algorithm under fewer distributional assumptions.
DSJun 16, 2021
Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean EstimationIlias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard et al.
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< α<\frac 1 2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-α)$-fraction are arbitrary. The goal is to output a small list of vectors, at least one of which is close to the mean of $\mathcal{D}$. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $O(n^{1 + ε_0} d)$, for any fixed $ε_0 > 0$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 α$. We leverage this result, together with additional techniques, to obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $Ω(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $α\to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms.
DSJun 10, 2021
Lower Bounds on Metropolized Sampling Methods for Well-Conditioned DistributionsYin Tat Lee, Ruoqi Shen, Kevin Tian
We give lower bounds on the performance of two of the most popular sampling methods in practice, the Metropolis-adjusted Langevin algorithm (MALA) and multi-step Hamiltonian Monte Carlo (HMC) with a leapfrog integrator, when applied to well-conditioned distributions. Our main result is a nearly-tight lower bound of $\widetildeΩ(κd)$ on the mixing time of MALA from an exponentially warm start, matching a line of algorithmic results up to logarithmic factors and answering an open question of Chewi et. al. We also show that a polynomial dependence on dimension is necessary for the relaxation time of HMC under any number of leapfrog steps, and bound the gains achievable by changing the step count. Our HMC analysis draws upon a novel connection between leapfrog integration and Chebyshev polynomials, which may be of independent interest.
DSNov 19, 2020
List-Decodable Mean Estimation in Nearly-PCA TimeIlias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard et al.
Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust list-decodable learning focuses on the more challenging regime where only a minority $\frac 1 k$ fraction of the dataset is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new list-decodable mean estimation algorithm for bounded covariance distributions with optimal sample complexity and error rate, running in nearly-PCA time. Assuming the ground truth distribution on $\mathbb{R}^d$ has bounded covariance, our algorithm outputs a list of $O(k)$ candidate means, one of which is within distance $O(\sqrt{k})$ from the truth. Our algorithm runs in time $\widetilde{O}(ndk)$ for all $k = O(\sqrt{d}) \cup Ω(d)$, where $n$ is the size of the dataset. We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$ for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery guarantee. This runtime matches up to logarithmic factors the cost of performing a single $k$-PCA on the data, which is a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest list-decodable mean estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$ and $\widetilde{O}(nd k^{\ge 6})$. Our approach builds on a novel soft downweighting method, $\mathsf{SIFT}$, which is arguably the simplest known polynomial-time mean estimation technique in the list-decodable learning setting. To develop our fast algorithms, we boost the computational cost of $\mathsf{SIFT}$ via a careful "win-win-win" analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which we believe may be of independent interest.
OCNov 12, 2020
Relative Lipschitzness in Extragradient Methods and a Direct Recipe for AccelerationMichael B. Cohen, Aaron Sidford, Kevin Tian
We show that standard extragradient methods (i.e. mirror prox and dual extrapolation) recover optimal accelerated rates for first-order minimization of smooth convex functions. To obtain this result we provide a fine-grained characterization of the convergence rates of extragradient methods for solving monotone variational inequalities in terms of a natural condition we call relative Lipschitzness. We further generalize this framework to handle local and randomized notions of relative Lipschitzness and thereby recover rates for box-constrained $\ell_\infty$ regression based on area convexity and complexity bounds achieved by accelerated (randomized) coordinate descent for smooth convex function minimization.
DSOct 7, 2020
Structured Logconcave Sampling with a Restricted Gaussian OracleYin Tat Lee, Ruoqi Shen, Kevin Tian
We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to improve dependences on problem conditioning. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $ε$. For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $κ$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(κd \log^3\frac{κd}ε)$, matching the state-of-the-art non-composite bound; no composite samplers with better mixing than general-purpose logconcave samplers were previously known. For logconcave finite sums $\exp(-F(x))$, where $F(x) = \frac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $κ$, we give a sampler querying $\widetilde{O}(n + κ\max(d, \sqrt{nd}))$ gradient oracles to $\{f_i\}_{i \in [n]}$; no high-accuracy samplers with nontrivial gradient query complexity were previously known. For densities with condition number $κ$, we give an algorithm obtaining mixing time $O(κd \log^2\frac{κd}ε)$, improving the prior state-of-the-art by a logarithmic factor with a significantly simpler analysis; we also show a zeroth-order algorithm attains the same query complexity.
DSSep 17, 2020
Coordinate Methods for Matrix GamesYair Carmon, Yujia Jin, Aaron Sidford et al.
We develop primal-dual coordinate methods for solving bilinear saddle-point problems of the form $\min_{x \in \mathcal{X}} \max_{y\in\mathcal{Y}} y^\top A x$ which contain linear programming, classification, and regression as special cases. Our methods push existing fully stochastic sublinear methods and variance-reduced methods towards their limits in terms of per-iteration complexity and sample complexity. We obtain nearly-constant per-iteration complexity by designing efficient data structures leveraging Taylor approximations to the exponential and a binomial heap. We improve sample complexity via low-variance gradient estimators using dynamic sampling distributions that depend on both the iterates and the magnitude of the matrix entries. Our runtime bounds improve upon those of existing primal-dual methods by a factor depending on sparsity measures of the $m$ by $n$ matrix $A$. For example, when rows and columns have constant $\ell_1/\ell_2$ norm ratios, we offer improvements by a factor of $m+n$ in the fully stochastic setting and $\sqrt{m+n}$ in the variance-reduced setting. We apply our methods to computational geometry problems, i.e. minimum enclosing ball, maximum inscribed ball, and linear regression, and obtain improved complexity bounds. For linear regression with an elementwise nonnegative matrix, our guarantees improve on exact gradient methods by a factor of $\sqrt{\mathrm{nnz}(A)/(m+n)}$.
OCAug 4, 2020
Fast and Near-Optimal Diagonal PreconditioningArun Jambulapati, Jerry Li, Christopher Musco et al.
The convergence rates of iterative methods for solving a linear system $\mathbf{A} x = b$ typically depend on the condition number of the matrix $\mathbf{A}$. Preconditioning is a common way of speeding up these methods by reducing that condition number in a computationally inexpensive way. In this paper, we revisit the decades-old problem of how to best improve $\mathbf{A}$'s condition number by left or right diagonal rescaling. We make progress on this problem in several directions. First, we provide new bounds for the classic heuristic of scaling $\mathbf{A}$ by its diagonal values (a.k.a. Jacobi preconditioning). We prove that this approach reduces $\mathbf{A}$'s condition number to within a quadratic factor of the best possible scaling. Second, we give a solver for structured mixed packing and covering semidefinite programs (MPC SDPs) which computes a constant-factor optimal scaling for $\mathbf{A}$ in $\widetilde{O}(\text{nnz}(\mathbf{A}) \cdot \text{poly}(κ^\star))$ time; this matches the cost of solving the linear system after scaling up to a $\widetilde{O}(\text{poly}(κ^\star))$ factor. Third, we demonstrate that a sufficiently general width-independent MPC SDP solver would imply near-optimal runtimes for the scaling problems we consider, and natural variants concerned with measures of average conditioning. Finally, we highlight connections of our preconditioning techniques to semi-random noise models, as well as applications in reducing risk in several statistical regression models.
DSJun 12, 2020
Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten PackingArun Jambulapati, Jerry Li, Kevin Tian
We develop two methods for the following fundamental statistical task: given an $ε$-corrupted set of $n$ samples from a $d$-dimensional sub-Gaussian distribution, return an approximate top eigenvector of the covariance matrix. Our first robust PCA algorithm runs in polynomial time, returns a $1 - O(ε\logε^{-1})$-approximate top eigenvector, and is based on a simple iterative filtering approach. Our second, which attains a slightly worse approximation factor, runs in nearly-linear time and sample complexity under a mild spectral gap assumption. These are the first polynomial-time algorithms yielding non-trivial information about the covariance of a corrupted sub-Gaussian distribution without requiring additional algebraic structure of moments. As a key technical tool, we develop the first width-independent solvers for Schatten-$p$ norm packing semidefinite programs, giving a $(1 + ε)$-approximate solution in $O(p\log(\tfrac{nd}ε)ε^{-1})$ input-sparsity time iterations (where $n$, $d$ are problem dimensions).
LGJun 10, 2020
Composite Logconcave Sampling with a Restricted Gaussian OracleRuoqi Shen, Kevin Tian, Yin Tat Lee
We consider sampling from composite densities on $\mathbb{R}^d$ of the form $dπ(x) \propto \exp(-f(x) - g(x))dx$ for well-conditioned $f$ and convex (but possibly non-smooth) $g$, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For $f$ with condition number $κ$, our algorithm runs in $O \left(κ^2 d \log^2\tfrac{κd}ε\right)$ iterations, each querying a gradient of $f$ and a restricted Gaussian oracle, to achieve total variation distance $ε$. The restricted Gaussian oracle, which draws samples from a distribution whose negative log-likelihood sums a quadratic and $g$, has been previously studied and is a natural extension of the proximal oracle used in composite optimization. Our algorithm is conceptually simple and obtains stronger provable guarantees and greater generality than existing methods for composite sampling. We conduct experiments showing our algorithm vastly improves upon the hit-and-run algorithm for sampling the restriction of a (non-diagonal) Gaussian to the positive orthant.
LGFeb 10, 2020
Logsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte CarloYin Tat Lee, Ruoqi Shen, Kevin Tian
We show that the gradient norm $\|\nabla f(x)\|$ for $x \sim \exp(-f(x))$, where $f$ is strongly convex and smooth, concentrates tightly around its mean. This removes a barrier in the prior state-of-the-art analysis for the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm for sampling from a strongly logconcave distribution. We correspondingly demonstrate that Metropolized HMC mixes in $\tilde{O}(κd)$ iterations, improving upon the $\tilde{O}(κ^{1.5}\sqrt{d} + κd)$ runtime of (Dwivedi et. al. '18, Chen et. al. '19) by a factor $(κ/d)^{1/2}$ when the condition number $κ$ is large. Our mixing time analysis introduces several techniques which to our knowledge have not appeared in the literature and may be of independent interest, including restrictions to a nonconvex set with good conductance behavior, and a new reduction technique for boosting a constant-accuracy total variation guarantee under weak warmness assumptions. This is the first high-accuracy mixing time result for logconcave distributions using only first-order function information which achieves linear dependence on $κ$; we also give evidence that this dependence is likely to be necessary for standard Metropolized first-order methods.
OCJul 3, 2019
Variance Reduction for Matrix GamesYair Carmon, Yujia Jin, Aaron Sidford et al.
We present a randomized primal-dual algorithm that solves the problem $\min_{x} \max_{y} y^\top A x$ to additive error $ε$ in time $\mathrm{nnz}(A) + \sqrt{\mathrm{nnz}(A)n}/ε$, for matrix $A$ with larger dimension $n$ and $\mathrm{nnz}(A)$ nonzero entries. This improves the best known exact gradient methods by a factor of $\sqrt{\mathrm{nnz}(A)/n}$ and is faster than fully stochastic gradient methods in the accurate and/or sparse regime $ε\le \sqrt{n/\mathrm{nnz}(A)}$. Our results hold for $x,y$ in the simplex (matrix games, linear programming) and for $x$ in an $\ell_2$ ball and $y$ in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines Nemirovski's "conceptual prox-method" and a novel reduced-variance gradient estimator based on "sampling from the difference" between the current iterate and a reference point.
DSJun 3, 2019
A Direct $\tilde{O}(1/ε)$ Iteration Parallel Algorithm for Optimal TransportArun Jambulapati, Aaron Sidford, Kevin Tian
Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $ε$ with $\tilde{O}(1/ε)$ parallel depth, and $\tilde{O}\left(n^2/ε\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem, this is optimal for first-order methods. Blanchet et. al. '18, Quanrud '19 obtained similar runtimes through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use complicated subroutines which may be deemed impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). The fastest practical algorithms run in time $\tilde{O}(\min(n^2 / ε^2, n^{2.5} / ε))$ (Dvurechensky et. al. '18, Lin et. al. '19). We bridge this gap by providing a parallel, first-order, $\tilde{O}(1/ε)$ iteration algorithm without worse dependence on dimension, and provide preliminary experimental evidence that our algorithm may enjoy improved practical performance. We obtain this runtime via a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow (Sherman '17).
LGMar 7, 2019
A Rank-1 Sketch for Matrix Multiplicative WeightsYair Carmon, John C. Duchi, Aaron Sidford et al.
We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a $\textit{randomized mirror projection}$, and perform mirror descent analysis on the $\textit{expected projection}$. Our sketch solves the online eigenvector problem, improving the best known complexity bounds by $Ω(\log^5 n)$. We also apply this sketch to semidefinite programming in saddle-point form, yielding a simple primal-dual scheme with guarantees matching the best in the literature.
LGSep 8, 2017
Learning Populations of ParametersKevin Tian, Weihao Kong, Gregory Valiant
Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0,1]$, and we observe $n$ independent random variables, $X_1,\ldots,X_n$, with $X_i \sim $ Binomial$(t, p_i)$. How accurately can one recover the "histogram" (i.e. cumulative density function) of the $p_i$'s? While the empirical estimates would recover the histogram to earth mover distance $Θ(\frac{1}{\sqrt{t}})$ (equivalently, $\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.