Yeshwanth Cherapanamjeri

LG
h-index16
20papers
447citations
Novelty61%
AI Score45

20 Papers

LGApr 18, 2023
Optimal PAC Bounds Without Uniform Convergence

Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty et al. · eth-zurich

In statistical learning theory, determining the sample complexity of realizable binary classification for VC classes was a long-standing open problem. The results of Simon and Hanneke established sharp upper bounds in this setting. However, the reliance of their argument on the uniform convergence principle limits its applicability to more general learning settings such as multiclass classification. In this paper, we address this issue by providing optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments. Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds. As an application, by adapting the one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth, we propose an algorithm that achieves an optimal PAC bound for binary classification. Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal, addressing a variant of a classic question posed by Warmuth. We further instantiate our framework in three settings where uniform convergence is provably suboptimal. For multiclass classification, we prove an optimal risk bound that scales with the one-inclusion hypergraph density of the class, addressing the suboptimality of the analysis of Daniely and Shalev-Shwartz. For partial hypothesis classification, we determine the optimal sample complexity bound, resolving a question posed by Alon, Hanneke, Holzman, and Moran. For realizable bounded regression with absolute loss, we derive an optimal risk bound that relies on a modified version of the scale-sensitive dimension, refining the results of Bartlett and Long. Our rates surpass standard uniform convergence-based results due to the smaller complexity measure in our risk bound.

STMay 6, 2022
What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas et al.

In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.

LGDec 19, 2022
The One-Inclusion Graph Algorithm is not Always Optimal

Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty et al. · eth-zurich

The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov's inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.

LGMar 3, 2022
Uniform Approximations for Randomized Hadamard Transforms with Applications

Yeshwanth Cherapanamjeri, Jelani Nelson

Randomized Hadamard Transforms (RHTs) have emerged as a computationally efficient alternative to the use of dense unstructured random matrices across a range of domains in computer science and machine learning. For several applications such as dimensionality reduction and compressed sensing, the theoretical guarantees for methods based on RHTs are comparable to approaches using dense random matrices with i.i.d.\ entries. However, several such applications are in the low-dimensional regime where the number of rows sampled from the matrix is rather small. Prior arguments are not applicable to the high-dimensional regime often found in machine learning applications like kernel approximation. Given an ensemble of RHTs with Gaussian diagonals, $\{M^i\}_{i = 1}^m$, and any $1$-Lipschitz function, $f: \mathbb{R} \to \mathbb{R}$, we prove that the average of $f$ over the entries of $\{M^i v\}_{i = 1}^m$ converges to its expectation uniformly over $\| v \| \leq 1$ at a rate comparable to that obtained from using truly Gaussian matrices. We use our inequality to then derive improved guarantees for two applications in the high-dimensional regime: 1) kernel approximation and 2) distance estimation. For kernel approximation, we prove the first \emph{uniform} approximation guarantees for random features constructed through RHTs lending theoretical justification to their empirical success while for distance estimation, our convergence result implies data structures with improved runtime guarantees over previous work by the authors. We believe our general inequality is likely to find use in other applications.

AIJul 1, 2024
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Aliyah R. Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.

Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or its approximations to identify subgraphs in models for specific tasks (circuits). They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads at specific sequence positions efficiently. CD-T consists of a set of mathematical equations to isolate contribution of model features. Through recursively computing contribution of all nodes in a computational graph of a model using CD-T followed by pruning, we are able to reduce circuit discovery runtime from hours to seconds compared to state-of-the-art baselines. On three standard circuit evaluation datasets (indirect object identification, greater-than comparisons, and docstring completion), we demonstrate that CD-T outperforms ACDC and EAP by better recovering the manual circuits with an average of 97% ROC AUC under low runtimes. In addition, we provide evidence that faithfulness of CD-T circuits is not due to random chance by showing our circuits are 80% more faithful than random circuits of up to 60% of the original model size. Finally, we show CD-T circuits are able to perfectly replicate original models' behavior (faithfulness $ = 1$) using fewer nodes than the baselines for all tasks. Our results underscore the great promise of CD-T for efficient automated mechanistic interpretability, paving the way for new insights into the workings of large language models.

STOct 16, 2023
Statistical Barriers to Affine-equivariant Estimation

Zihao Chen, Yeshwanth Cherapanamjeri

We investigate the quantitative performance of affine-equivariant estimators for robust mean estimation. As a natural stability requirement, the construction of such affine-equivariant estimators has been extensively studied in the statistics literature. We quantitatively evaluate these estimators under two outlier models which have been the subject of much recent work: the heavy-tailed and adversarial corruption settings. We establish lower bounds which show that affine-equivariance induces a strict degradation in recovery error with quantitative rates degrading by a factor of $\sqrt{d}$ in both settings. We find that classical estimators such as the Tukey median (Tukey '75) and Stahel-Donoho estimator (Stahel '81 and Donoho '82) are either quantitatively sub-optimal even within the class of affine-equivariant estimators or lack any quantitative guarantees. On the other hand, recent estimators with strong quantitative guarantees are not affine-equivariant or require additional distributional assumptions to achieve it. We remedy this by constructing a new affine-equivariant estimator which nearly matches our lower bound. Our estimator is based on a novel notion of a high-dimensional median which may be of independent interest. Notably, our results are applicable more broadly to any estimator whose performance is evaluated in the Mahalanobis norm which, for affine-equivariant estimators, corresponds to an evaluation in Euclidean norm on isotropic distributions.

LGNov 5, 2024
How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

LGJun 16, 2025
The Space Complexity of Learning-Unlearning Algorithms

Yeshwanth Cherapanamjeri, Sumegha Garg, Nived Rajaraman et al.

We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to delete certain training samples at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class \(\mathcal{H}\). Toward that end, we first provide a negative result showing that the VC dimension is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which any unlearning algorithm for realizability testing needs to store \(Ω(n)\)-bits, where \(n\) denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class \(\mathcal{H}\), the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the \textit{eluder dimension} of \(\mathcal{H}\), a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound in terms of the star number of the underlying hypothesis class, albeit in a stronger ticketed-memory model proposed by Ghazi et al. (2023). Since the star number for a hypothesis class is never larger than its Eluder dimension, our work highlights a fundamental separation between central and ticketed memory models for machine unlearning.

LGOct 17, 2025
Learning Correlated Reward Models: Statistical Barriers and Opportunities

Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina et al.

Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.

STNov 22, 2024
Heavy-tailed Contamination is Easier than Adversarial Contamination

Yeshwanth Cherapanamjeri, Daniel Lee

A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability.

CLMay 27, 2023
Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making

Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Briton Park et al.

Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains.

DSOct 17, 2021
Terminal Embeddings in Sublinear Time

Yeshwanth Cherapanamjeri, Jelani Nelson

Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $ρ\ge 1$ if $ρ$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C ρd_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+ε$ is achievable via such a terminal embedding with $m = O(ε^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $Θ(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-Θ(ε^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.

LGJun 23, 2021
Adversarial Examples in Multi-Layer Random ReLU Networks

Peter L. Bartlett, Sébastien Bubeck, Yeshwanth Cherapanamjeri

We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.

LGApr 8, 2021
A single gradient step finds adversarial examples on random two-layers neural networks

Sébastien Bubeck, Yeshwanth Cherapanamjeri, Gauthier Gidel et al.

Daniely and Schacham recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term "undercomplete" refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function.

STNov 24, 2020
Optimal Mean Estimation without a Variance

Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett et al.

We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample $\mathbf{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mathcal{D}$ over $\mathbb{R}^d$ with mean $μ$ which satisfies the following \emph{weak-moment} assumption for some ${α\in [0, 1]}$: \begin{equation*} \forall \|v\| = 1: \mathbb{E}_{X \thicksim \mathcal{D}}[\lvert \langle X - μ, v\rangle \rvert^{1 + α}] \leq 1, \end{equation*} and given a target failure probability, $δ$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n,d,δ$. For the specific case of $α= 1$, foundational work of Lugosi and Mendelson exhibits an estimator achieving subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general $α$, and establish the following information-theoretic lower bound on the optimal attainable confidence interval: \begin{equation*} Ω\left(\sqrt{\frac{d}{n}} + \left(\frac{d}{n}\right)^{\fracα{(1 + α)}} + \left(\frac{\log 1 / δ}{n}\right)^{\fracα{(1 + α)}}\right). \end{equation*} Moreover, we devise a computationally-efficient estimator which achieves this lower bound.

MLJul 16, 2020
Optimal Robust Linear Regression in Nearly Linear Time

Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni et al.

We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = \langle X,w^* \rangle + ε$ (with $X \in \mathbb{R}^d$ and $ε$ independent), in which an $η$ fraction of the samples have been adversarially corrupted. We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $\mathbb{E} [XX^\top]$ has bounded condition number and $ε$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $ε$ is sub-Gaussian. In both settings, our estimators: (a) Achieve optimal sample complexities and recovery guarantees up to log factors and (b) Run in near linear time ($\tilde{O}(nd / η^6)$). Prior to our work, polynomial time algorithms achieving near optimal sample complexities were only known in the setting where $X$ is Gaussian with identity covariance and $ε$ is Gaussian, and no linear time estimators were known for robust linear regression in any setting. Our estimators and their analysis leverage recent developments in the construction of faster algorithms for robust mean estimation to improve runtimes, and refined concentration of measure arguments alongside Gaussian rounding techniques to improve statistical sample complexities.

STFeb 6, 2019
Testing Markov Chains without Hitting

Yeshwanth Cherapanamjeri, Peter L. Bartlett

We study the problem of identity testing of markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix $Q$ and the goal is to determine whether $Q = P$ for some known matrix $P$ or $\text{Dist}(P, Q) \geq ε$ where $\text{Dist}$ is suitably defined. In recent work by Daskalakis, Dikkala and Gravin, 2018, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of $P$ which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.

STFeb 6, 2019
Fast Mean Estimation with Sub-Gaussian Rates

Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett

We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^4+n^2d)$ for $n$ i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.

LGFeb 18, 2017
Thresholding based Efficient Outlier Robust PCA

Yeshwanth Cherapanamjeri, Prateek Jain, Praneeth Netrapalli

We consider the problem of outlier robust PCA (OR-PCA) where the goal is to recover principal directions despite the presence of outlier data points. That is, given a data matrix $M^*$, where $(1-α)$ fraction of the points are noisy samples from a low-dimensional subspace while $α$ fraction of the points can be arbitrary outliers, the goal is to recover the subspace accurately. Existing results for \OR-PCA have serious drawbacks: while some results are quite weak in the presence of noise, other results have runtime quadratic in dimension, rendering them impractical for large scale applications. In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. Moreover, the fraction of outliers, $α$, that our method can handle is tight up to constants while providing nearly optimal computational complexity for a general noise setting. For the special case where the inliers are obtained from a low-dimensional subspace with additive Gaussian noise, we show that a modification of our thresholding based method leads to significant improvement in recovery error (of the subspace) even in the presence of a large fraction of outliers.

LGJun 23, 2016
Nearly-optimal Robust Matrix Completion

Yeshwanth Cherapanamjeri, Kartik Gupta, Prateek Jain

In this paper, we consider the problem of Robust Matrix Completion (RMC) where the goal is to recover a low-rank matrix by observing a small number of its entries out of which a few can be arbitrarily corrupted. We propose a simple projected gradient descent method to estimate the low-rank matrix that alternately performs a projected gradient descent step and cleans up a few of the corrupted entries using hard-thresholding. Our algorithm solves RMC using nearly optimal number of observations as well as nearly optimal number of corruptions. Our result also implies significant improvement over the existing time complexity bounds for the low-rank matrix completion problem. Finally, an application of our result to the robust PCA problem (low-rank+sparse matrix separation) leads to nearly linear time (in matrix dimensions) algorithm for the same; existing state-of-the-art methods require quadratic time. Our empirical results corroborate our theoretical results and show that even for moderate sized problems, our method for robust PCA is an an order of magnitude faster than the existing methods.