DSNov 11, 2022
Re-Analyze Gauss: Bounds for Private Matrix Approximation via Dyson Brownian MotionOren Mangoubi, Nisheeth K. Vishnoi
Given a symmetric matrix $M$ and a vector $λ$, we present new bounds on the Frobenius-distance utility of the Gaussian mechanism for approximating $M$ by a matrix whose spectrum is $λ$, under $(\varepsilon,δ)$-differential privacy. Our bounds depend on both $λ$ and the gaps in the eigenvalues of $M$, and hold whenever the top $k+1$ eigenvalues of $M$ have sufficiently large gaps. When applied to the problems of private rank-$k$ covariance matrix approximation and subspace recovery, our bounds yield improvements over previous bounds. Our bounds are obtained by viewing the addition of Gaussian noise as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations allow us to bound the utility as the square-root of a sum-of-squares of perturbations to the eigenvectors, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems.
DSJul 6, 2022
Private Matrix Approximation and Geometry of Unitary OrbitsOren Mangoubi, Yikai Wu, Satyen Kale et al.
Consider the following optimization problem: Given $n \times n$ matrices $A$ and $Λ$, maximize $\langle A, UΛU^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $Λ$ and, by setting $Λ$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
DSJun 29, 2023
Private Covariance Approximation and Eigenvalue-Gap Bounds for Complex Gaussian PerturbationsOren Mangoubi, Nisheeth K. Vishnoi
We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,δ)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and show that the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$ is bounded by roughly $\tilde{O}(\sqrt{kd})$, whenever there is an appropriately large gap between the $k$'th and the $k+1$'th eigenvalues of $M$. This improves on previous work that requires that the gap between every pair of top-$k$ eigenvalues of $M$ is at least $\sqrt{d}$ for a similar bound. Our analysis leverages the fact that the eigenvalues of complex matrix Brownian motion repel more than in the real case, and uses Dyson's stochastic differential equations governing the evolution of its eigenvalues to show that the eigenvalues of the matrix $M$ perturbed by complex Gaussian noise have large gaps with high probability. Our results contribute to the analysis of low-rank approximations under average-case perturbations and to an understanding of eigenvalue gaps for random matrices, which may be of independent interest.
DSJun 19, 2022
Sampling from Log-Concave Distributions over Polytopes via a Soft-Threshold Dikin WalkOren Mangoubi, Nisheeth K. Vishnoi
Given a Lipschitz or smooth convex function $\, f:K \to \mathbb{R}$ for a bounded polytope $K \subseteq \mathbb{R}^d$ defined by $m$ inequalities, we consider the problem of sampling from the log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to $K$. Interest in this problem derives from its applications to Bayesian inference and differentially private learning. Our main result is a generalization of the Dikin walk Markov chain to this setting that requires at most $O((md + d L^2 R^2) \times md^{ω-1}) \log(\frac{w}δ))$ arithmetic operations to sample from $π$ within error $δ>0$ in the total variation distance from a $w$-warm start. Here $L$ is the Lipschitz-constant of $f$, $K$ is contained in a ball of radius $R$ and contains a ball of smaller radius $r$, and $ω$ is the matrix-multiplication constant. Our algorithm improves on the running time of prior works for a range of parameter settings important for the aforementioned learning applications. Technically, we depart from previous Dikin walks by adding a "soft-threshold" regularizer derived from the Lipschitz or smoothness properties of $f$ to the log-barrier function for $K$ that allows our version of the Dikin walk to propose updates that have a high Metropolis acceptance ratio for $f$, while at the same time remaining inside the polytope $K$.
DSSep 6, 2024
Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear SolversOren Mangoubi, Nisheeth K. Vishnoi
We consider the problem of sampling from a log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to a polytope $K:=\{θ\in \mathbb{R}^d: Aθ\leq b\}$, where $A\in \mathbb{R}^{m\times d}$ and $b \in \mathbb{R}^m$.The fastest-known algorithm \cite{mangoubi2022faster} for the setting when $f$ is $O(1)$-Lipschitz or $O(1)$-smooth runs in roughly $O(md \times md^{ω-1})$ arithmetic operations, where the $md^{ω-1}$ term arises because each Markov chain step requires computing a matrix inversion and determinant (here $ω\approx 2.37$ is the matrix multiplication constant). We present a nearly-optimal implementation of this Markov chain with per-step complexity which is roughly the number of non-zero entries of $A$ while the number of Markov chain steps remains the same. The key technical ingredients are 1) to show that the matrices that arise in this Dikin walk change slowly, 2) to deploy efficient linear solvers that can leverage this slow change to speed up matrix inversion by using information computed in previous steps, and 3) to speed up the computation of the determinantal term in the Metropolis filter step via a randomized Taylor series-based estimator.
DSFeb 11, 2025
Private Low-Rank Approximation for Covariance Matrices, Dyson Brownian Motion, and Eigenvalue-Gap Bounds for Gaussian PerturbationsOren Mangoubi, Nisheeth K. Vishnoi
We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,δ)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and obtain upper bounds on the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$. Our analysis provides improvements over previous bounds, particularly when the spectrum of $M$ satisfies natural structural assumptions. The novel insight is to view the addition of Gaussian noise to a matrix as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations enable us to upper bound the Frobenius distance between the best rank-$k$ approximation of $M$ and that of a Gaussian perturbation of $M$ as an integral that involves inverse eigenvalue gaps of the stochastically evolving matrix, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems. Subsequently, again using the Dyson Brownian motion viewpoint, we show that the eigenvalues of the matrix $M$ perturbed by Gaussian noise have large gaps with high probability. These results also contribute to the analysis of low-rank approximations under average-case perturbations, and to an understanding of eigenvalue gaps for random matrices, both of which may be of independent interest.
LGMay 27, 2025
Efficient Diffusion Models for Symmetric ManifoldsOren Mangoubi, Neil He, Nisheeth K. Vishnoi
We introduce a framework for designing efficient diffusion models for $d$-dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either $d$ gradient evaluations or exponential-in-$d$ arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run in $O(1)$ gradient evaluations and nearly-linear-in-$d$ ($O(d^{1.19})$) arithmetic operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.
DSNov 7, 2021
Sampling from Log-Concave Distributions with Infinity-Distance GuaranteesOren Mangoubi, Nisheeth K. Vishnoi
For a $d$-dimensional log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $ν$ which is $\varepsilon$-close in infinity-distance $\sup_{θ\in K} |\log \frac{ν(θ)}{π(θ)}|$ to $π$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $π$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $π$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
DCApr 16, 2021
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep LearningShijian Li, Oren Mangoubi, Lijie Xu et al.
Stochastic Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. For example, while Bulk Synchronous Parallel (BSP) often achieves better converged accuracy, the corresponding training throughput can be negatively impacted by stragglers. In contrast, Asynchronous Parallel (ASP) can have higher throughput, but its convergence and accuracy can be impacted by stale gradients. To improve the performance of synchronization protocol, recent work often focuses on designing new protocols with a heavy reliance on hard-to-tune hyper-parameters. In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP, i.e., reducing training time while simultaneously maintaining the converged accuracy. Based on extensive empirical profiling, we devise a collection of adaptive policies that determine how and when to switch between synchronization protocols. Our policies include both offline ones that target recurring jobs and online ones for handling transient stragglers. We implement the proposed policies in a prototype system, called Sync-Switch, on top of TensorFlow, and evaluate the training performance with popular deep learning models and datasets. Our experiments show that Sync-Switch achieves up to 5.13X throughput speedup and similar converged accuracy when comparing to BSP. Further, we observe that Sync-Switch achieves 3.8% higher converged accuracy with just 1.23X the training time compared to training with ASP. Moreover, Sync-Switch can be used in settings when training with ASP leads to divergence errors. Sync-Switch achieves all of these benefits with very low overhead, e.g., the framework overhead can be as low as 1.7% of the total training time.
LGJun 22, 2020
A Convergent and Dimension-Independent Min-Max Optimization AlgorithmVijay Keswani, Oren Mangoubi, Sushant Sachdeva et al.
We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
DSJun 22, 2020
Greedy Adversarial Equilibrium: An Efficient Alternative to Nonconvex-Nonconcave Min-Max OptimizationOren Mangoubi, Nisheeth K. Vishnoi
Min-max optimization of an objective function $f: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is an important model for robustness in an adversarial setting, with applications to many areas including optimization, economics, and deep learning. In many applications $f$ may be nonconvex-nonconcave, and finding a global min-max point may be computationally intractable. There is a long line of work that seeks computationally tractable algorithms for alternatives to the min-max optimization model. However, many of the alternative models have solution points which are only guaranteed to exist under strong assumptions on $f$, such as convexity, monotonicity, or special properties of the starting point. We propose an optimization model, the $\varepsilon$-greedy adversarial equilibrium, and show that it can serve as a computationally tractable alternative to the min-max optimization model. Roughly, we say that a point $(x^\star, y^\star)$ is an $\varepsilon$-greedy adversarial equilibrium if $y^\star$ is an $\varepsilon$-approximate local maximum for $f(x^\star,\cdot)$, and $x^\star$ is an $\varepsilon$-approximate local minimum for a "greedy approximation" to the function $\max_z f(x, z)$ which can be efficiently estimated using second-order optimization algorithms. We prove the existence of such a point for any smooth function which is bounded and has Lipschitz Hessian. To prove existence, we introduce an algorithm that converges from any starting point to an $\varepsilon$-greedy adversarial equilibrium in a number of evaluations of the function $f$, the max-player's gradient $\nabla_y f(x,y)$, and its Hessian $\nabla^2_y f(x,y)$, that is polynomial in the dimension $d$, $1/\varepsilon$, and the bounds on $f$ and its Lipschitz constant.
DSMay 5, 2019
Faster polytope rounding, sampling, and volume computation via a sublinear "Ball Walk"Oren Mangoubi, Nisheeth K. Vishnoi
We study the problem of "isotropically rounding" a polytope $K\subset\mathbb{R}^n$, that is, computing a linear transformation which makes the uniform distribution on the polytope have roughly identity covariance matrix. We assume $K$ is defined by $m$ linear inequalities, with guarantee that $rB\subset K\subset RB$, where $B$ is the unit ball. We introduce a new variant of the ball walk Markov chain and show that, roughly, the expected number of arithmetic operations per-step of this Markov chain is $O(m)$ that is sublinear in the input size $mn$--the per-step time of all prior Markov chains. Subsequently, we give a rounding algorithm that succeeds with probability $1-\varepsilon$ in $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ arithmetic operations. This gives a factor of $\sqrt{n}$ improvement on the previous bound of $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for rounding, which uses the hit-and-run algorithm. Since the rounding preprocessing step is in many cases the bottleneck in improving sampling or volume computation, our results imply these tasks can also be achieved in roughly $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4δ^{-2})$ operations for computing the volume of $K$ up to a factor $1+δ$ and $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})))$ for uniformly sampling on $K$ with TV error $\varepsilon$. This improves on the previous bounds of $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4δ^{-2})$ for volume computation when roughly $m\geq n^{2.5}$, and $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for sampling when roughly $m\geq n^{1.5}$. We achieve this improvement by a novel method of computing polytope membership, where one avoids checking inequalities estimated to have a very low probability of being violated.
DSFeb 22, 2019
Nonconvex sampling with the Metropolis-adjusted Langevin algorithmOren Mangoubi, Nisheeth K. Vishnoi
The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex statistics and machine learning applications. Despite this, current bounds for the Langevin algorithms are slower than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampling or optimizing non-convex log-densities. In this paper, we obtain improved bounds in many of these situations, showing that the Metropolis-adjusted Langevin algorithm (MALA) is faster than the best bounds for its competitor algorithms when the target distribution satisfies weak third- and fourth- order regularity properties associated with the input data. In many settings, our regularity conditions are weaker than the usual Euclidean operator norm regularity properties, allowing us to show faster bounds for a much larger class of distributions than would be possible with the usual Euclidean operator norm approach, including in statistics and machine learning applications where the data satisfy a certain incoherence condition. In particular, we show that using our regularity conditions one can obtain faster bounds for applications which include sampling problems in Bayesian logistic regression with weakly convex priors, and the nonconvex optimization problem of learning linear classifiers with zero-one loss functions. Our main technical contribution in this paper is our analysis of the Metropolis acceptance probability of MALA in terms of its "energy-conservation error," and our bound for this error in terms of third- and fourth- order regularity conditions. Our combination of this higher-order analysis of the energy conservation error with the conductance method is key to obtaining bounds which have a sub-linear dependence on the dimension $d$ in the non-strongly logconcave setting.
LGFeb 21, 2019
Online Sampling from Log-Concave DistributionsHolden Lee, Oren Mangoubi, Nisheeth K. Vishnoi
Given a sequence of convex functions $f_0, f_1, \ldots, f_T$, we study the problem of sampling from the Gibbs distribution $π_t \propto e^{-\sum_{k=0}^tf_k}$ for each epoch $t$ in an online manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. Our main result is an algorithm that generates roughly independent samples from $π_t$ for every epoch $t$ and, under mild assumptions, makes $\mathrm{polylog}(T)$ gradient evaluations per epoch. All previous results imply a bound on the number of gradient or function evaluations which is at least linear in $T$. Motivated by real-world applications, we assume that functions are smooth, their associated distributions have a bounded second moment, and their minimizer drifts in a bounded manner, but do not assume they are strongly convex. In particular, our assumptions hold for online Bayesian logistic regression, when the data satisfy natural regularity properties, giving a sampling algorithm with updates that are poly-logarithmic in $T$. In simulations, our algorithm achieves accuracy comparable to an algorithm specialized to logistic regression. Key to our algorithm is a novel stochastic gradient Langevin dynamics Markov chain with a carefully designed variance reduction step and constant batch size. Technically, lack of strong convexity is a significant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in $T$ for enough time to reach within $\varepsilon$ of $π_t$.
PRAug 9, 2018
Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities?Oren Mangoubi, Natesh S. Pillai, Aaron Smith
Hamiltonian Monte Carlo (HMC) is a very popular and generic collection of Markov chain Monte Carlo (MCMC) algorithms. One explanation for the popularity of HMC algorithms is their excellent performance as the dimension $d$ of the target becomes large: under conditions that are satisfied for many common statistical models, optimally-tuned HMC algorithms have a running time that scales like $d^{0.25}$. In stark contrast, the running time of the usual Random-Walk Metropolis (RWM) algorithm, optimally tuned, scales like $d$. This superior scaling of the HMC algorithm with dimension is attributed to the fact that it, unlike RWM, incorporates the gradient information in the proposal distribution. In this paper, we investigate a different scaling question: does HMC beat RWM for highly $\textit{multimodal}$ targets? We find that the answer is often $\textit{no}$. We compute the spectral gaps for both the algorithms for a specific class of multimodal target densities, and show that they are identical. The key reason is that, within one mode, the gradient is effectively ignorant about other modes, thus negating the advantage the HMC algorithm enjoys in unimodal targets. We also give heuristic arguments suggesting that the above observation may hold quite generally. Our main tool for answering this question is a novel simple formula for the conductance of HMC using Liouville's theorem. This result allows us to compute the spectral gap of HMC algorithms, for both the classical HMC with isotropic momentum and the recent Riemannian HMC, for multimodal targets.
DSFeb 24, 2018
Dimensionally Tight Bounds for Second-Order Hamiltonian Monte CarloOren Mangoubi, Nisheeth K. Vishnoi
Hamiltonian Monte Carlo (HMC) is a widely deployed method to sample from high-dimensional distributions in Statistics and Machine learning. HMC is known to run very efficiently in practice and its popular second-order "leapfrog" implementation has long been conjectured to run in $d^{1/4}$ gradient evaluations. Here we show that this conjecture is true when sampling from strongly log-concave target distributions that satisfy a weak third-order regularity property associated with the input data. Our regularity condition is weaker than the Lipschitz Hessian property and allows us to show faster convergence bounds for a much larger class of distributions than would be possible with the usual Lipschitz Hessian constant alone. Important distributions that satisfy our regularity condition include posterior distributions used in Bayesian logistic regression for which the data satisfies an "incoherence" property. Our result compares favorably with the best available bounds for the class of strongly log-concave distributions, which grow like $d^{{1}/{2}}$ gradient evaluations with the dimension. Moreover, our simulations on synthetic data suggest that, when our regularity condition is satisfied, leapfrog HMC performs better than its competitors -- both in terms of accuracy and in terms of the number of gradient evaluations it requires.
DSNov 7, 2017
Convex Optimization with Unbounded Nonconvex Oracles using Simulated AnnealingOren Mangoubi, Nisheeth K. Vishnoi
We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $αF(x) + β$ for some $α, β> 0$, and present a polynomial time algorithm that finds an approximate minimizer of $F$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining "simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.