DSJul 6, 2023
A Near-Linear Time Algorithm for the Chamfer DistanceAinesh Bakshi, Piotr Indyk, Rajesh Jayaram et al.
For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. Further, the Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first $(1+ε)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\varepsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds. We also give evidence that if the goal is to \emph{report} a $(1+\varepsilon)$-approximate mapping from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.
QUANT-PHOct 3, 2023
Learning quantum Hamiltonians at any temperature in polynomial timeAinesh Bakshi, Allen Liu, Ankur Moitra et al.
We study the problem of learning a local quantum Hamiltonian $H$ given copies of its Gibbs state $ρ= e^{-βH}/\textrm{tr}(e^{-βH})$ at a known inverse temperature $β>0$. Anshu, Arunachalam, Kuwahara, and Soleimanifar (arXiv:2004.07266) gave an algorithm to learn a Hamiltonian on $n$ qubits to precision $ε$ with only polynomially many copies of the Gibbs state, but which takes exponential time. Obtaining a computationally efficient algorithm has been a major open problem [Alhambra'22 (arXiv:2204.08349)], [Anshu, Arunachalam'22 (arXiv:2204.08349)], with prior work only resolving this in the limited cases of high temperature [Haah, Kothari, Tang'21 (arXiv:2108.04842)] or commuting terms [Anshu, Arunachalam, Kuwahara, Soleimanifar'21]. We fully resolve this problem, giving a polynomial time algorithm for learning $H$ to precision $ε$ from polynomially many copies of the Gibbs state at any constant $β> 0$. Our main technical contribution is a new flat polynomial approximation to the exponential function, and a translation between multi-variate scalar polynomials and nested commutators. This enables us to formulate Hamiltonian learning as a polynomial system. We then show that solving a low-degree sum-of-squares relaxation of this polynomial system suffices to accurately learn the Hamiltonian.
OCJan 23, 2023
A New Approach to Learning Linear Dynamical SystemsAinesh Bakshi, Allen Liu, Ankur Moitra et al.
Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.
LGDec 1, 2022
Sub-quadratic Algorithms for Kernel Matrices via Kernel Density EstimationAinesh Bakshi, Piotr Indyk, Praneeth Kacham et al.
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $Ω(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
LGJul 13, 2023
Tensor Decompositions Meet Control Theory: Learning General Mixtures of Linear Dynamical SystemsAinesh Bakshi, Allen Liu, Ankur Moitra et al.
Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a close relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.
DSApr 6, 2023
Krylov Methods are (nearly) Optimal for Low-Rank ApproximationAinesh Bakshi, Shyam Narayanan
We consider the problem of rank-$1$ low-rank approximation (LRA) in the matrix-vector product model under various Schatten norms: $$ \min_{\|u\|_2=1} \|A (I - u u^\top)\|_{\mathcal{S}_p} , $$ where $\|M\|_{\mathcal{S}_p}$ denotes the $\ell_p$ norm of the singular values of $M$. Given $\varepsilon>0$, our goal is to output a unit vector $v$ such that $$ \|A(I - vv^\top)\|_{\mathcal{S}_p} \leq (1+\varepsilon) \min_{\|u\|_2=1}\|A(I - u u^\top)\|_{\mathcal{S}_p}. $$ Our main result shows that Krylov methods (nearly) achieve the information-theoretically optimal number of matrix-vector products for Spectral ($p=\infty$), Frobenius ($p=2$) and Nuclear ($p=1$) LRA. In particular, for Spectral LRA, we show that any algorithm requires $Ω\left(\log(n)/\varepsilon^{1/2}\right)$ matrix-vector products, exactly matching the upper bound obtained by Krylov methods [MM15, BCW22]. Our lower bound addresses Open Question 1 in [Woo14], providing evidence for the lack of progress on algorithms for Spectral LRA and resolves Open Question 1.2 in [BCW22]. Next, we show that for any fixed constant $p$, i.e. $1\leq p =O(1)$, there is an upper bound of $O\left(\log(1/\varepsilon)/\varepsilon^{1/3}\right)$ matrix-vector products, implying that the complexity does not grow as a function of input size. This improves the $O\left(\log(n/\varepsilon)/\varepsilon^{1/3}\right)$ bound recently obtained in [BCW22], and matches their $Ω\left(1/\varepsilon^{1/3}\right)$ lower bound, to a $\log(1/\varepsilon)$ factor.
DSNov 29, 2023
Metric Embeddings Beyond Bi-Lipschitz Distortion via Sherali-AdamsAinesh Bakshi, Vincent Cohen-Addad, Samuel B. Hopkins et al.
Metric embeddings are a widely used method in algorithm design, where generally a ``complex'' metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that are commonly used in practice avoid bi-Lipschitz distortion; yet these approaches have received comparatively less study in theory. In this paper, we focus on Multi-dimensional Scaling (MDS), where we are given a set of non-negative dissimilarities $\{d_{i,j}\}_{i,j\in [n]}$ over $n$ points, and the goal is to find an embedding $\{x_1,\dots,x_n\} \subset R^k$ that minimizes $$\textrm{OPT}=\min_{x}\mathbb{E}_{i,j\in [n]}\left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2.$$ Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine et. al. (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant dimensional Euclidean space with cost $\textrm{OPT} +ε$ in $n^2\cdot 2^{\textrm{poly}(Δ/ε)}$ time, where $Δ$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, $Δ$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $Δ$: for constant dimensional Euclidean space, we achieve a solution with cost $O(\log Δ)\cdot \textrm{OPT}^{Ω(1)}+ε$ in time $n^{O(1)} \cdot 2^{\text{poly}((\log(Δ)/ε))}$. Our algorithms are based on a novel geometry-aware analysis of a conditional rounding of the Sherali-Adams LP Hierarchy, allowing us to avoid exponential dependency on the aspect ratio, which would typically result from this rounding.
96.5QUANT-PHMay 14
Sharp Bounds on the Eigenvalues of Kikuchi Graphs and Applications to Quantum Max CutAinesh Bakshi, Arpon Basu, Pravesh Kothari et al.
We prove that the maximum eigenvalue of the (both signed and unsigned) Laplacian of level $k$ Kikuchi graph of any graph $G$ with $m$ edges is at most $m+k$. This confirms four recent conjectures of Apte, Parekh, and Sud. As applications, we obtain that tensor products of one and two qubit product states achieve an approximation ratio of $5/8$ for Quantum Max Cut and $5/7$ for the XY Hamiltonian. Moreover, combining our bounds with the algorithms analyzed by Apte, Parekh, and Sud, yields efficient algorithms achieving an approximation ratio of $0.614$ for Quantum Max Cut and $0.674$ for the XY Hamiltonian. Finally, we also make modest progress on Brouwer's conjecture and improve Lew's bound on the sum of the top-$k$ eigenvalues of a Graph Laplacian.
QUANT-PHApr 30, 2024
Structure learning of Hamiltonians from real-time evolutionAinesh Bakshi, Allen Liu, Ankur Moitra et al.
We study the problem of Hamiltonian structure learning from real-time evolution: given the ability to apply $e^{-\mathrm{i} Ht}$ for an unknown local Hamiltonian $H = \sum_{a = 1}^m λ_a E_a$ on $n$ qubits, the goal is to recover $H$. This problem is already well-understood under the assumption that the interaction terms, $E_a$, are given, and only the interaction strengths, $λ_a$, are unknown. But how efficiently can we learn a local Hamiltonian without prior knowledge of its interaction structure? We present a new, general approach to Hamiltonian learning that not only solves the challenging structure learning variant, but also resolves other open questions in the area, all while achieving the gold standard of Heisenberg-limited scaling. In particular, our algorithm recovers the Hamiltonian to $\varepsilon$ error with total evolution time $O(\log (n)/\varepsilon)$, and has the following appealing properties: (1) it does not need to know the Hamiltonian terms; (2) it works beyond the short-range setting, extending to any Hamiltonian $H$ where the sum of terms interacting with a qubit has bounded norm; (3) it evolves according to $H$ in constant time $t$ increments, thus achieving constant time resolution. As an application, we can also learn Hamiltonians exhibiting power-law decay up to accuracy $\varepsilon$ with total evolution time beating the standard limit of $1/\varepsilon^2$.
86.0DSApr 24
Entrywise Low-Rank Approximation and Matrix $p \rightarrow q$ Norms via Global Correlation RoundingPrashanti Anderson, Ainesh Bakshi, Samuel Hopkins
Given a matrix $A$, the goal of the entrywise low-rank approximation problem is to find $\operatorname{argmin} \|A-B\|_p$ over all rank-$k$ matrices $B$, where $\| \cdot \|_p$ is the entrywise $\ell_p$ norm. When $p = 2$ this well-studied problem is solved by the singular value decomposition, but for $p \neq 2$ the problem becomes computationally challenging. For every even $p > 2$ and every fixed $k$, we give the first polynomial-time approximation scheme for this problem, improving on the $(3 + \varepsilon)$ approximation of Ban, Bhattiprolu, Bringmann, Kolev, Lee, and Woodruff, the bi-criteria approximation of Woodruff and Yasuda, and the additive approximation scheme of Anderson, Bakshi, and Hopkins. Prior algorithmic approaches based on sketching and column selection, which yielded a polynomial-time approximation scheme in the $p < 2$ setting, face concrete barriers when $p > 2$. Instead, we use the Sherali-Adams hierarchy of convex programs, and in so doing establish a blueprint for how to use convex hierarchies to design polynomial-time approximation schemes for continuous optimization problems. We use the same algorithmic strategy to give a new family of additive approximation algorithms for matrix $p \rightarrow q$ norms, which are intimately related to small-set expansion and quantum information. In particular, we give the first nontrivial additive approximation algorithms in the regime $p < 2 < q$.
DSMay 23, 2024
Efficient Certificates of Anti-Concentration Beyond GaussiansAinesh Bakshi, Pravesh Kothari, Goutham Rajendran et al.
A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $δ$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq δ$ is at most $O(δ)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificates of anti-concentration in the average case, when the set of points $X$ corresponds to samples from a Gaussian distribution. Their certificates played a crucial role in several subsequent works in algorithmic robust statistics on list-decodable learning and settling the robust learnability of arbitrary Gaussian mixtures, yet remain limited to rotationally invariant distributions. This work presents a new (and arguably the most natural) formulation for anti-concentration. Using this formulation, we give quasi-polynomial time verifiable sum-of-squares certificates of anti-concentration that hold for a wide class of non-Gaussian distributions including anti-concentrated bounded product distributions and uniform distributions over $L_p$ balls (and their affine transformations). Consequently, our method upgrades and extends results in algorithmic robust statistics e.g., list-decodable learning and clustering, to such distributions. Our approach constructs a canonical integer program for anti-concentration and analysis a sum-of-squares relaxation of it, independent of the intended application. We rely on duality and analyze a pseudo-expectation on large subsets of the input points that take a small value in some direction. Our analysis uses the method of polynomial reweightings to reduce the problem to analyzing only analytically dense or sparse directions.
69.9QUANT-PHApr 9
Rapid mixing for high-temperature Gibbs states with arbitrary external fieldsAinesh Bakshi, Xinyu Tan
Gibbs states are a natural model of quantum matter at thermal equilibrium. We investigate the role of external fields in shaping the entanglement structure and computational complexity of high-temperature Gibbs states. External fields can induce entanglement in states that are otherwise provably separable, and the crossover scale is $h\asymp β^{-1} \log(1/β)$, where $h$ is an upper bound on any on-site potential and $β$ is the inverse temperature. We introduce a quasi-local Lindbladian that satisfies detailed balance and rapidly mixes to the Gibbs state in $\mathcal{O}(\log(n/ε))$ time, even in the presence of an arbitrary on-site external field. Additionally, we prove that for any $β<1$, there exist local Hamiltonians for which sampling from the computational-basis distribution of the corresponding Gibbs state with a sufficiently large external field is classically hard, under standard complexity-theoretic assumptions. Therefore, high-temperature Gibbs states with external fields are natural physical models that can exhibit entanglement and classical hardness while also admitting efficient quantum Gibbs samplers, making them suitable candidates for quantum advantage via state preparation.
DSMar 31, 2025
Sample-Optimal Private Regression in Polynomial TimePrashanti Anderson, Ainesh Bakshi, Mahbod Majid et al.
We consider the task of privately obtaining prediction error guarantees in ordinary least-squares regression problems with Gaussian covariates (with unknown covariance structure). We provide the first sample-optimal polynomial time algorithm for this task under both pure and approximate differential privacy. We show that any improvement to the sample complexity of our algorithm would violate either statistical-query or information-theoretic lower bounds. Additionally, our algorithm is robust to a small fraction of arbitrary outliers and achieves optimal error rates as a function of the fraction of outliers. In contrast, all prior efficient algorithms either incurred sample complexities with sub-optimal dimension dependence, scaling with the condition number of the covariates, or obtained a polynomially worse dependence on the privacy parameters. Our technical contributions are two-fold: first, we leverage resilience guarantees of Gaussians within the sum-of-squares framework. As a consequence, we obtain efficient sum-of-squares algorithms for regression with optimal robustness rates and sample complexity. Second, we generalize the recent robustness-to-privacy framework [HKMN23, (arXiv:2212.05015)] to account for the geometry induced by the covariance of the input samples. This framework crucially relies on the robust estimators to be sum-of-squares algorithms, and combining the two steps yields a sample-optimal private regression algorithm. We believe our techniques are of independent interest, and we demonstrate this by obtaining an efficient algorithm for covariance-aware mean estimation, with an optimal dependence on the privacy parameters.
DSFeb 10, 2022
Low-Rank Approximation with $1/ε^{1/3}$ Matrix-Vector ProductsAinesh Bakshi, Kenneth L. Clarkson, David P. Woodruff
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $ε$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+ε)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrtε)$ matrix-vector products, improving on the naïve $\tilde{O}(k/ε)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/ε))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/ε^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/ε^{1/2})$ bound to $\tilde{O}(k/ε^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $(1+ ε)$-factor when $p \geq (\log d)/ε$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $Ω(1/ε^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tildeΘ(1/ε^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.
LGMay 17, 2021
Learning a Latent Simplex in Input-Sparsity TimeAinesh Bakshi, Chiranjib Bhattacharyya, Ravi Kannan et al.
We consider the problem of learning a latent $k$-vertex simplex $K\subset\mathbb{R}^d$, given access to $A\in\mathbb{R}^{d\times n}$, which can be viewed as a data matrix with $n$ points that are obtained by randomly perturbing latent points in the simplex $K$ (potentially beyond $K$). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA, 2020) give an algorithm for learning such a latent simplex in time roughly $O(k\cdot\textrm{nnz}(A))$, where $\textrm{nnz}(A)$ is the number of non-zeros in $A$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes $k$ matrix-vector product queries to $A$ and each query is a function of all queries preceding it. Since each matrix-vector product requires $\textrm{nnz}(A)$ time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to $A$ in input-sparsity time and show that the column space thus obtained has small $\sinΘ$ (angular) distance to the right top-$k$ singular space of $A$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $Θ(k\cdot\textrm{nnz}(A))$ running time.
DSDec 3, 2020
Robustly Learning Mixtures of $k$ Arbitrary GaussiansAinesh Bakshi, Ilias Diakonikolas, He Jia et al.
We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\mathbb{R}^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mixture of TV-distance separated Gaussians, and (c) a uniform mixture of two Gaussians. Our main tools are an efficient \emph{partial clustering} algorithm that relies on the sum-of-squares method, and a novel \emph{tensor decomposition} algorithm that allows errors in both Frobenius norm and low-rank terms.
MLJun 29, 2020
Robust Linear Regression: Optimal Rates in Polynomial TimeAinesh Bakshi, Adarsh Prasad
We obtain robust and computationally efficient estimators for learning several linear models that achieve statistically optimal convergence rate under minimal distributional assumptions. Concretely, we assume our data is drawn from a $k$-hypercontractive distribution and an $ε$-fraction is adversarially corrupted. We then describe an estimator that converges to the optimal least-squares minimizer for the true distribution at a rate proportional to $ε^{2-2/k}$, when the noise is independent of the covariates. We note that no such estimator was known prior to our work, even with access to unbounded computation. The rate we achieve is information-theoretically optimal and thus we resolve the main open question in Klivans, Kothari and Meka [COLT'18]. Our key insight is to identify an analytic condition that serves as a polynomial relaxation of independence of random variables. In particular, we show that when the moments of the noise and covariates are negatively-correlated, we obtain the same rate as independent noise. Further, when the condition is not satisfied, we obtain a rate proportional to $ε^{2-4/k}$, and again match the information-theoretic lower bound. Our central technical contribution is to algorithmically exploit independence of random variables in the "sum-of-squares" framework by formulating it as the aforementioned polynomial inequality.
DSMay 6, 2020
Outlier-Robust Clustering of Non-Spherical MixturesAinesh Bakshi, Pravesh Kothari
We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs). Concretely, our algorithm takes input an $ε$-corrupted sample from a $k$-GMM and whp in $d^{\text{poly}(k/η)}$ time, outputs an approximate clustering that misclassifies at most $k^{O(k)}(ε+η)$ fraction of the points whenever every pair of mixture components are separated by $1-\exp(-\text{poly}(k/η)^k)$ in total variation (TV) distance. Such a result was not previously known even for $k=2$. TV separation is the statistically weakest possible notion of separation and captures important special cases such as mixed linear regression and subspace clustering. Our main conceptual contribution is to distill simple analytic properties - (certifiable) hypercontractivity and bounded variance of degree 2 polynomials and anti-concentration of linear projections - that are necessary and sufficient for mixture models to be (efficiently) clusterable. As a consequence, our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere. Even the information-theoretic clusterability of separated distributions satisfying these two analytic assumptions was not known prior to our work and is likely to be of independent interest. Our algorithms build on the recent sequence of works relying on certifiable anti-concentration first introduced in the works of Karmarkar, Klivans, and Kothari and Raghavendra, and Yau in 2019. Our techniques expand the sum-of-squares toolkit to show robust certifiability of TV-separated Gaussian clusters in data. This involves giving a low-degree sum-of-squares proof of statements that relate parameter (i.e. mean and covariances) distance to total variation distance by relying only on hypercontractivity and anti-concentration.
DSFeb 12, 2020
List-Decodable Subspace Recovery: Dimension Independent Error in Polynomial TimeAinesh Bakshi, Pravesh K. Kothari
In list-decodable subspace recovery, the input is a collection of $n$ points $αn$ (for some $α\ll 1/2$) of which are drawn i.i.d. from a distribution $\mathcal{D}$ with a isotropic rank $r$ covariance $Π_*$ (the \emph{inliers}) and the rest are arbitrary, potential adversarial outliers. The goal is to recover a $O(1/α)$ size list of candidate covariances that contains a $\hatΠ$ close to $Π_*$. Two recent independent works (Raghavendra-Yau, Bakshi-Kothari 2020) gave the first efficient algorithm for this problem. These results, however, obtain an error that grows with the dimension (linearly in [RY] and logarithmically in BK) at the cost of quasi-polynomial running time) and rely on \emph{certifiable anti-concentration} - a relatively strict condition satisfied essentially only by the Gaussian distribution. In this work, we improve on these results on all three fronts: \emph{dimension-independent} error via a faster fixed-polynomial running time under less restrictive distributional assumptions. Specifically, we give a $poly(1/α) d^{O(1)}$ time algorithm that outputs a list containing a $\hatΠ$ satisfying $\|\hatΠ -Π_*\|_F \leq O(1/α)$. Our result only needs $\mathcal{D}$ to have \emph{certifiably hypercontractive} degree 2 polynomials. As a result, in addition to Gaussians, our algorithm applies to the uniform distribution on the hypercube and $q$-ary cubes and arbitrary product distributions with subgaussian marginals. Prior work (Raghavendra and Yau, 2020) had identified such distributions as potential hard examples as such distributions do not exhibit strong enough anti-concentration. When $\mathcal{D}$ satisfies certifiable anti-concentration, we obtain a stronger error guarantee of $\|\hatΠ-Π_*\|_F \leq η$ for any arbitrary $η> 0$ in $d^{O(poly(1/α) + \log (1/η))}$ time.
DSDec 9, 2019
Robust and Sample Optimal Algorithms for PSD Low-Rank ApproximationAinesh Bakshi, Nadiia Chepurko, David P. Woodruff
Recently, Musco and Woodruff (FOCS, 2017) showed that given an $n \times n$ positive semidefinite (PSD) matrix $A$, it is possible to compute a $(1+ε)$-approximate relative-error low-rank approximation to $A$ by querying $O(nk/ε^{2.5})$ entries of $A$ in time $O(nk/ε^{2.5} +n k^{ω-1}/ε^{2(ω-1)})$. They also showed that any relative-error low-rank approximation algorithm must query $Ω(nk/ε)$ entries of $A$, this gap has since remained open. Our main result is to resolve this question by obtaining an optimal algorithm that queries $O(nk/ε)$ entries of $A$ and outputs a relative-error low-rank approximation in $O(n(k/ε)^{ω-1})$ time. Note, our running time improves that of Musco and Woodruff, and matches the information-theoretic lower bound if the matrix-multiplication exponent $ω$ is $2$. We then extend our techniques to negative-type distance matrices. Bakshi and Woodruff (NeurIPS, 2018) showed a bi-criteria, relative-error low-rank approximation which queries $O(nk/ε^{2.5})$ entries and outputs a rank-$(k+4)$ matrix. We show that the bi-criteria guarantee is not necessary and obtain an $O(nk/ε)$ query algorithm, which is optimal. Our algorithm applies to all distance matrices that arise from metrics satisfying negative-type inequalities, including $\ell_1, \ell_2,$ spherical metrics and hypermetrics. Next, we introduce a new robust low-rank approximation model which captures PSD matrices that have been corrupted with noise. While a sample complexity lower bound precludes sublinear algorithms for arbitrary PSD matrices, we provide the first sublinear time and query algorithms when the corruption on the diagonal entries is bounded. As a special case, we show sample-optimal sublinear time algorithms for low-rank approximation of correlation matrices corrupted by noise.
DSNov 5, 2018
Learning Two Layer Rectified Neural Networks in Polynomial TimeAinesh Bakshi, Rajesh Jayaram, David P. Woodruff
Consider the following fundamental learning problem: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. We consider two-layer networks, mapping $\mathbb{R}^d$ to $\mathbb{R}^m$, with $k$ non-linear activation units $f(\cdot)$, where $f(x) = \max \{x , 0\}$ is the ReLU. Such a network is specified by two weight matrices, $\mathbf{U}^* \in \mathbb{R}^{m \times k}, \mathbf{V}^* \in \mathbb{R}^{k \times d}$, such that the label of an example $x \in \mathbb{R}^{d}$ is given by $\mathbf{U}^* f(\mathbf{V}^* x)$, where $f(\cdot)$ is applied coordinate-wise. Given $n$ samples as a matrix $\mathbf{X} \in \mathbb{R}^{d \times n}$ and the (possibly noisy) labels $\mathbf{U}^* f(\mathbf{V}^* \mathbf{X}) + \mathbf{E}$ of the network on these samples, where $\mathbf{E}$ is a noise matrix, our goal is to recover the weight matrices $\mathbf{U}^*$ and $\mathbf{V}^*$. In this work, we develop algorithms and hardness results under varying assumptions on the input and noise. Although the problem is NP-hard even for $k=2$, by assuming Gaussian marginals over the input $\mathbf{X}$ we are able to develop polynomial time algorithms for the approximate recovery of $\mathbf{U}^*$ and $\mathbf{V}^*$. Perhaps surprisingly, in the noiseless case our algorithms recover $\mathbf{U}^*,\mathbf{V}^*$ exactly, i.e., with no error. To the best of the our knowledge, this is the first algorithm to accomplish exact recovery. For the noisy case, we give the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise $\mathbf{E}$. Our algorithms generalize to a larger class of rectified activation functions, $f(x) = 0$ when $x\leq 0$, and $f(x) > 0$ otherwise.
DSSep 19, 2018
Sublinear Time Low-Rank Approximation of Distance MatricesAinesh Bakshi, David P. Woodruff
Let $\mathbf{P}=\{ p_1, p_2, \ldots p_n \}$ and $\mathbf{Q} = \{ q_1, q_2 \ldots q_m \}$ be two point sets in an arbitrary metric space. Let $\mathbf{A}$ represent the $m\times n$ pairwise distance matrix with $\mathbf{A}_{i,j} = d(p_i, q_j)$. Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric $d$, it is possible to achieve an additive error low-rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices $\mathbf{A}$, and consequently our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling. We then show that in general, relative error approximation in sublinear time is impossible for distance matrices, even if one allows for bicriteria solutions. Additionally, we show that if $\mathbf{P} = \mathbf{Q}$ and $d$ is the squared Euclidean distance, which is not a metric but rather the square of a metric, then a relative error bicriteria solution can be found in sublinear time.
DSMar 2, 2017
Robust Communication-Optimal Distributed Clustering AlgorithmsPranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan et al.
In this work, we study the $k$-median and $k$-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a $k$-median or $k$-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal $k$-median or $k$-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Balcan et. al 2013], even when a constant fraction of the data are outliers. The communication complexity is $\tilde O(sk+z)$ where $s$ is the number of machines, $k$ is the number of clusters, and $z$ is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching $Ω(sk+z)$ lower bound on the communication required both for approximating the optimal $k$-means or $k$-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, $Ω(sk+z)$ is a communication bottleneck, even for real-world instances.
LGDec 26, 2014
A Novel Feature Selection and Extraction Technique for ClassificationKratarth Goel, Raunaq Vohra, Ainesh Bakshi
This paper presents a versatile technique for the purpose of feature selection and extraction - Class Dependent Features (CDFs). We use CDFs to improve the accuracy of classification and at the same time control computational expense by tackling the curse of dimensionality. In order to demonstrate the generality of this technique, it is applied to handwritten digit recognition and text categorization.