Larry Wasserman

ML
h-index16
55papers
5,169citations
Novelty48%
AI Score53

55 Papers

MESep 13, 2023
Simultaneous inference for generalized linear models with unmeasured confounders

Jin-Hong Du, Larry Wasserman, Kathryn Roeder

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

STNov 5, 2025
Statistical Properties of Rectified Flow

Gonzalo Mena, Arun Kumar Kuchibhotla, Larry Wasserman

Rectified flow (Liu et al., 2022; Liu, 2022; Wu et al., 2023) is a method for defining a transport map between two distributions, and enjoys popularity in machine learning, although theoretical results supporting the validity of these methods are scant. The rectified flow can be regarded as an approximation to optimal transport, but in contrast to other transport methods that require optimization over a function space, computing the rectified flow only requires standard statistical tools such as regression or density estimation. Because of this, one can leverage standard data analysis tools for regression and density estimation to develop empirical versions of transport maps. We study some structural properties of the rectified flow, including existence, uniqueness, and regularity, as well as the related statistical properties, such as rates of convergence and central limit theorems, for some selected estimators. To do so, we analyze separately the bounded and unbounded cases as each presents unique challenges. In both cases, we are able to establish convergence at faster rates than the ones for the usual nonparametric regression and density estimation.

MLMar 24
Wasserstein Parallel Transport for Predicting the Dynamics of Statistical Systems

Tristan Luca Saidi, Gonzalo Mena, Larry Wasserman et al.

Many scientific systems, such as cellular populations or economic cohorts, are naturally described by probability distributions that evolve over time. Predicting how such a system would have evolved under different forces or initial conditions is fundamental to causal inference, domain adaptation, and counterfactual prediction. However, the space of distributions often lacks the vector space structure on which classical methods rely. To address this, we introduce a general notion of parallel dynamics at a distributional level. We base this principle on parallel transport of tangent dynamics along optimal transport geodesics and call it ``Wasserstein Parallel Trends''. By replacing the vector subtraction of classic methods with geodesic parallel transport, we can provide counterfactual comparisons of distributional dynamics in applications such as causal inference, domain adaptation, and batch-effect correction in experimental settings. The main mathematical contribution is a novel notion of fanning scheme on the Wasserstein manifold that allows us to efficiently approximate parallel transport along geodesics while also providing the first theoretical guarantees for parallel transport in the Wasserstein space. We also show that Wasserstein Parallel Trends recovers the classic parallel trends assumption for averages as a special case and derive closed-form parallel transport for Gaussian measures. We deploy the method on synthetic data and two single-cell RNA sequencing datasets to impute gene-expression dynamics across biological systems.

STJun 23, 2025
Statistical Inference for Optimal Transport Maps: Recent Advances and Perspectives

Sivaraman Balakrishnan, Tudor Manole, Larry Wasserman

In many applications of optimal transport (OT), the object of primary interest is the optimal transport map. This map rearranges mass from one probability distribution to another in the most efficient way possible by minimizing a specified cost. In this paper we review recent advances in estimating and developing limit theorems for the OT map, using samples from the underlying distributions. We also review parallel lines of work that establish similar results for special cases and variants of the basic OT setup. We conclude with a discussion of key directions for future research with the goal of providing practitioners with reliable inferential tools.

MLJun 30, 2025
Disentangled Feature Importance

Jin-Hong Du, Kathryn Roeder, Larry Wasserman

Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias. To address this limitation, we introduce \emph{Disentangled Feature Importance (DFI)}, a nonparametric generalization of the classical $R^2$ decomposition via optimal transport. DFI transforms correlated features into independent latent variables using a transport map, eliminating correlation distortion. Importance is computed in this disentangled space and attributed back through the transport map's sensitivity. DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models and to interaction-weighted functional ANOVA variances more generally, under arbitrary feature dependencies. We develop a comprehensive semiparametric theory for DFI. For general transport maps, we establish root-$n$ consistency and asymptotic normality of importance estimators in the latent space, which extends to the original feature space for the Bures-Wasserstein map. Notably, our estimators achieve second-order estimation error, which vanishes if both regression function and transport map estimation errors are $o_{\mathbb{P}}(n^{-1/4})$. By design, DFI avoids the computational burden of repeated submodel refitting and the challenges of conditional covariate distribution estimation, thereby achieving computational efficiency.

MEDec 21, 2021
Data fission: splitting a single data point

James Leiner, Boyan Duan, Larry Wasserman et al.

Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

MENov 21, 2021
Decorrelated Variable Importance

Isabella Verdinelli, Larry Wasserman

Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter - known as LOCO (Leave Out COvariates) - based on dropping covariates from a regression model. This is essentially a nonparametric version of R-squared. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.

MENov 17, 2021
Universal Inference Meets Random Projections: A Scalable Test for Log-concavity

Robin Dunn, Aditya Gangrade, Larry Wasserman et al.

Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent universal inference methodology provides a valid test. The universal test relies on maximum likelihood estimation (MLE), and efficient methods already exist for finding the log-concave MLE. This yields the first test of log-concavity that is provably valid in finite samples in any dimension, for which we also establish asymptotic consistency results. Empirically, we find that a random projections approach that converts the d-dimensional testing problem into many one-dimensional problems can yield high power, leading to a simple procedure that is statistically and computationally efficient.

STJul 26, 2021
Plugin Estimation of Smooth Optimal Transport Maps

Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed et al.

We analyze a number of natural estimators for the optimal transport map between two distributions and show that they are minimax optimal. We adopt the plugin approach: our estimators are simply optimal couplings between measures derived from our observations, appropriately extended so that they define functions on $\mathbb{R}^d$. When the underlying map is assumed to be Lipschitz, we show that computing the optimal coupling between the empirical measures, and extending it using linear smoothers, already gives a minimax optimal estimator. When the underlying map enjoys higher regularity, we show that the optimal coupling between appropriate nonparametric density estimates yields faster rates. Our work also provides new bounds on the risk of corresponding plugin estimators for the quadratic Wasserstein distance, and we show how this problem relates to that of estimating optimal transport maps using stability arguments for smooth and strongly convex Brenier potentials. As an application of our results, we derive central limit theorems for plugin estimators of the squared Wasserstein distance, which are centered at their population counterpart when the underlying distributions have sufficiently smooth densities. In contrast to known central limit theorems for empirical estimators, this result easily lends itself to statistical inference for the quadratic Wasserstein distance.

MLMar 8, 2021
Forest Guided Smoothing

Isabella Verdinelli, Larry Wasserman

We use the output of a random forest to define a family of local smoothers with spatially adaptive bandwidth matrices. The smoother inherits the flexibility of the original forest but, since it is a simple, linear smoother, it is very interpretable and it can be used for tasks that would be intractable for the original forest. This includes bias correction, confidence intervals, assessing variable importance and methods for exploring the structure of the forest. We illustrate the method on some synthetic examples and on data related to Covid-19.

MLJun 26, 2020
The huge Package for High-dimensional Undirected Graph Estimation in R

Tuo Zhao, Han Liu, Kathryn Roeder et al.

We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency.

LGFeb 7, 2020
PLLay: Efficient Topological Layer based on Persistence Landscapes

Kwangho Kim, Jisu Kim, Manzil Zaheer et al.

We propose PLLay, a novel topological layer for general deep learning models based on persistence landscapes, in which we can efficiently exploit the underlying topological features of the input data structure. In this work, we show differentiability with respect to layer inputs, for a general persistent homology with arbitrary filtration. Thus, our proposed layer can be placed anywhere in the network and feed critical information on the topological features of input data into subsequent layers to improve the learnability of the networks toward a given task. A task-optimal structure of PLLay is learned during training via backpropagation, without requiring any input featurization or data preprocessing. We provide a novel adaptation for the DTM function-based filtration, and show that the proposed layer is robust against noise and outliers through a stability analysis. We demonstrate the effectiveness of our approach by classification experiments on various datasets.

STDec 24, 2019
Universal Inference

Larry Wasserman, Aaditya Ramdas, Sivaraman Balakrishnan

We propose a general method for constructing hypothesis tests and confidence sets that have finite sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood ratio statistic, that we call "the split likelihood ratio test" (split LRT). The method is especially appealing for irregular statistical models. Canonical examples include mixture models and models that arise in shape-constrained inference. Constructing tests and confidence sets for such models is notoriously difficult. Typical inference methods, like the likelihood ratio test, are not useful in these cases because they have intractable limiting distributions. In contrast, the method we suggest works for any parametric model and also for some nonparametric models. The split LRT can also be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid $p$-values and confidence sequences.

MEOct 7, 2019
Gaussian Mixture Clustering Using Relative Tests of Fit

Purvasha Chakravarti, Sivaraman Balakrishnan, Larry Wasserman

We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.

STSep 17, 2019
Minimax Confidence Intervals for the Sliced Wasserstein Distance

Tudor Manole, Sivaraman Balakrishnan, Larry Wasserman

Motivated by the growing popularity of variants of the Wasserstein distance in statistics and machine learning, we study statistical inference for the Sliced Wasserstein distance--an easily computable variant of the Wasserstein distance. Specifically, we construct confidence intervals for the Sliced Wasserstein distance which have finite-sample validity under no assumptions or under mild moment assumptions. These intervals are adaptive in length to the regularity of the underlying distributions. We also bound the minimax risk of estimating the Sliced Wasserstein distance, and as a consequence establish that the lengths of our proposed confidence intervals are minimax optimal over appropriate distribution classes. To motivate the choice of these classes, we also study minimax rates of estimating a distribution under the Sliced Wasserstein distance. These theoretical findings are complemented with a simulation study demonstrating the deficiencies of the classical bootstrap, and the advantages of our proposed methods. We also show strong correspondences between our theoretical predictions and the adaptivity of our confidence interval lengths in simulations. We conclude by demonstrating the use of our confidence intervals in the setting of simulator-based likelihood-free inference. In this setting, contrasting popular approximate Bayesian computation methods, we develop uncertainty quantification methods with rigorous frequentist coverage guarantees.

MLMay 24, 2018
Cautious Deep Learning

Yotam Hechtlinger, Barnabás Póczos, Larry Wasserman

Most classifiers operate by selecting the maximum of an estimate of the conditional distribution $p(y|x)$ where $x$ stands for the features of the instance to be classified and $y$ denotes its label. This often results in a {\em hubristic bias}: overconfidence in the assignment of a definite label. Usually, the observations are concentrated on a small volume but the classifier provides definite predictions for the entire space. We propose constructing conformal prediction sets which contain a set of labels rather than a single label. These conformal prediction sets contain the true label with probability $1-α$. Our construction is based on $p(x|y)$ rather than $p(y|x)$ which results in a classifier that is very cautious: it outputs the null set --- meaning "I don't know" --- when the object does not resemble the training examples. An important property of our approach is that adversarial attacks are likely to be predicted as the null set or would also include the true label. We demonstrate the performance on the ImageNet ILSVRC dataset and the CelebA and IMDB-Wiki facial datasets using high dimensional features obtained from state of the art convolutional neural networks.

MLDec 17, 2017
Hypothesis Testing for High-Dimensional Multinomials: A Selective Review

Sivaraman Balakrishnan, Larry Wasserman

The statistical analysis of discrete data has been the subject of extensive statistical research dating back to the work of Pearson. In this survey we review some recently developed methods for testing hypotheses about high-dimensional multinomials. Traditional tests like the $χ^2$ test and the likelihood ratio test can have poor power in the high-dimensional setting. Much of the research in this area has focused on finding tests with asymptotically Normal limits and developing (stringent) conditions under which tests have Normal limits. We argue that this perspective suffers from a significant deficiency: it can exclude many high-dimensional cases when - despite having non Normal null distributions - carefully designed tests can have high power. Finally, we illustrate that taking a minimax perspective and considering refinements of this perspective can lead naturally to powerful and practical tests.

STJun 30, 2017
Hypothesis Testing For Densities and High-Dimensional Multinomials: Sharp Local Minimax Rates

Sivaraman Balakrishnan, Larry Wasserman

We consider the goodness-of-fit testing problem of distinguishing whether the data are drawn from a specified distribution, versus a composite alternative separated from the null in the total variation metric. In the discrete case, we consider goodness-of-fit testing when the null distribution has a possibly growing or unbounded number of categories. In the continuous case, we consider testing a Lipschitz density, with possibly unbounded support, in the low-smoothness regime where the Lipschitz parameter is not assumed to be constant. In contrast to existing results, we show that the minimax rate and critical testing radius in these settings depend strongly, and in a precise way, on the null distribution being tested and this motivates the study of the (local) minimax rate as a function of the null distribution. For multinomials the local minimax rate was recently studied in the work of Valiant and Valiant. We re-visit and extend their results and develop two modifications to the chi-squared test whose performance we characterize. For testing Lipschitz densities, we show that the usual binning tests are inadequate in the low-smoothness regime and we design a spatially adaptive partitioning scheme that forms the basis for our locally minimax optimal tests. Furthermore, we provide the first local minimax lower bounds for this problem which yield a sharp characterization of the dependence of the critical radius on the null hypothesis being tested. In the low-smoothness regime we also provide adaptive tests, that adapt to the unknown smoothness parameter. We illustrate our results with a variety of simulations that demonstrate the practical utility of our proposed tests.

MESep 2, 2016
Least Ambiguous Set-Valued Classifiers with Bounded Error Levels

Mauricio Sadinle, Jing Lei, Larry Wasserman

In most classification tasks there are observations that are ambiguous and therefore difficult to correctly label. Set-valued classifiers output sets of plausible labels rather than a single label, thereby giving a more appropriate and informative treatment to the labeling of ambiguous instances. We introduce a framework for multiclass set-valued classification, where the classifiers guarantee user-defined levels of coverage or confidence (the probability that the true label is contained in the set) while minimizing the ambiguity (the expected size of the output). We first derive oracle classifiers assuming the true distribution to be known. We show that the oracle classifiers are obtained from level sets of the functions that define the conditional probability of each class. Then we develop estimators with good asymptotic and finite sample properties. The proposed estimators build on existing single-label classifiers. The optimal classifier can sometimes output the empty set, but we provide two solutions to fix this issue that are suitable for various practical needs.

MEJun 1, 2016
Finding Singular Features

Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli et al.

We present a method for finding high density, low-dimensional structures in noisy point clouds. These structures are sets with zero Lebesgue measure with respect to the $D$-dimensional ambient space and belong to a $d<D$ dimensional space. We call them "singular features." Hunting for singular features corresponds to finding unexpected or unknown structures hidden in point clouds belonging to $\R^D$. Our method outputs well defined sets of dimensions $d<D$. Unlike spectral clustering, the method works well in the presence of noise. We show how to find singular features by first finding ridges in the estimated density, followed by a filtering step based on the eigenvalues of the Hessian of the density.

STMay 20, 2016
Statistical Inference for Cluster Trees

Jisu Kim, Yen-Chi Chen, Sivaraman Balakrishnan et al.

A cluster tree provides a highly-interpretable summary of a density function by representing the hierarchy of its high-density clusters. It is estimated using the empirical tree, which is the cluster tree constructed from a density estimator. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical significance of topological features of an empirical cluster tree. We first study a variety of metrics that can be used to compare different trees, analyze their properties and assess their suitability for inference. We then propose methods to construct and summarize confidence sets for the unknown true cluster tree. We introduce a partial ordering on cluster trees which we use to prune some of the statistically insignificant features of the empirical tree, yielding interpretable and parsimonious cluster trees. Finally, we illustrate the proposed methods on a variety of synthetic examples and furthermore demonstrate their utility in the analysis of a Graft-versus-Host Disease (GvHD) data set.

MEApr 14, 2016
Distribution-Free Predictive Inference For Regression

Jing Lei, Max G'Sell, Alessandro Rinaldo et al.

We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called {\it rank-one-out} conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, in order to adapt to heteroskedascity in the data. Finally, we propose a model-free notion of variable importance, called {\it leave-one-covariate-out} or LOCO inference. Accompanying this paper is an R package {\tt conformalInference} that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.

LGFeb 6, 2016
Classification accuracy as a proxy for two sample testing

Ilmun Kim, Aaditya Ramdas, Aarti Singh et al.

When data analysts train a classifier and check if its accuracy is significantly different from chance, they are implicitly performing a two-sample test. We investigate the statistical properties of this flexible approach in the high-dimensional setting. We prove two results that hold for all classifiers in any dimensions: if its true error remains $ε$-better than chance for some $ε>0$ as $d,n \to \infty$, then (a) the permutation-based test is consistent (has power approaching to one), (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. To get a finer understanding of the rates of consistency, we study a specialized setting of distinguishing Gaussians with mean-difference $δ$ and common (known or unknown) covariance $Σ$, when $d/n \to c \in (0,\infty)$. We study variants of Fisher's linear discriminant analysis (LDA) such as "naive Bayes" in a nontrivial regime when $ε\to 0$ (the Bayes classifier has true accuracy approaching 1/2), and contrast their power with corresponding variants of Hotelling's test. Surprisingly, the expressions for their power match exactly in terms of $n,d,δ,Σ$, and the LDA approach is only worse by a constant factor, achieving an asymptotic relative efficiency (ARE) of $1/\sqrtπ$ for balanced samples. We also extend our results to high-dimensional elliptical distributions with finite kurtosis. Other results of independent interest include minimax lower bounds, and the optimality of Hotelling's test when $d=o(n)$. Simulation results validate our theory, and we present practical takeaway messages along with natural open problems.

MLJan 23, 2016
Minimax Lower Bounds for Linear Independence Testing

Aaditya Ramdas, David Isenberg, Aarti Singh et al.

Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n \to \infty$, $(p+q)/n \leq κ< \infty$, without sparsity assumptions). In summary, our results imply that $n$ must be at least as large as $\sqrt {pq}/\|Σ_{XY}\|_F^2$ for any procedure (test) to have non-trivial power, where $Σ_{XY}$ is the cross-covariance matrix of $X,Y$. We also provide some evidence that the lower bound is tight, by connections to two-sample testing and regression in specific settings.

MEOct 8, 2015
Statistical Analysis of Persistence Intensity Functions

Yen-Chi Chen, Daren Wang, Alessandro Rinaldo et al.

Persistence diagrams are two-dimensional plots that summarize the topological features of functions and are an important part of topological data analysis. A problem that has received much attention is how deal with sets of persistence diagrams. How do we summarize them, average them or cluster them? One approach -- the persistence intensity function -- was introduced informally by Edelsbrunner, Ivanov, and Karasev (2012). Here we provide a modification and formalization of this approach. Using the persistence intensity function, we can visualize multiple diagrams, perform clustering and conduct two-sample tests.

STAug 4, 2015
Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos et al.

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.

STJun 29, 2015
Statistical Inference using the Morse-Smale Complex

Yen-Chi Chen, Christopher R. Genovese, Larry Wasserman

The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two existing methods: mode clustering and Morse-Smale regression. We also develop two new methods based on the Morse-Smale complex: a visualization technique for multivariate functions and a two-sample, multivariate hypothesis test.

MEJun 7, 2015
Optimal Ridge Detection using Coverage Risk

Yen-Chi Chen, Christopher R. Genovese, Shirley Ho et al.

We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estimators. We apply our method to three simulated datasets and to cosmology data. In all the examples, the proposed method successfully recover the underlying density structure.

MLMay 15, 2015
An Analysis of Active Learning With Uniform Feature Noise

Aaditya Ramdas, Barnabas Poczos, Aarti Singh et al.

In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, feature noise is known as "errors in variables" and has been studied extensively in non-active settings. However, the effect of feature noise in active learning has not been studied before. We consider the well-known Berkson errors-in-variables model with additive uniform noise of width $σ$. Our simple but revealing setting is that of one-dimensional binary classification setting where the goal is to learn a threshold (point where the probability of a $+$ label crosses half). We deal with regression functions that are antisymmetric in a region of size $σ$ around the threshold and also satisfy Tsybakov's margin condition around the threshold. We prove minimax lower and upper bounds which demonstrate that when $σ$ is smaller than the minimiax active/passive noiseless error derived in \cite{CN07}, then noise has no effect on the rates and one achieves the same noiseless rates. For larger $σ$, the \textit{unflattening} of the regression function on convolution with uniform noise, along with its local antisymmetry around the threshold, together yield a behaviour where noise \textit{appears} to be beneficial. Our key result is that active learning can buy significant improvement over a passive strategy even in the presence of feature noise.

STMay 3, 2015
Risk Bounds For Mode Clustering

Martin Azizyan, Yen-Chi Chen, Aarti Singh et al.

Density mode clustering is a nonparametric clustering method. The clusters are the basins of attraction of the modes of a density estimator. We study the risk of mode-based clustering. We show that the clustering risk over the cluster cores --- the regions where the density is high --- is very small even in high dimensions. And under a low noise condition, the overall cluster risk is small even beyond the cores, in high dimensions.

MEDec 4, 2014
Nonparametric modal regression

Yen-Chi Chen, Christopher R. Genovese, Ryan J. Tibshirani et al.

Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.

STNov 23, 2014
On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos et al.

Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives). The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test's power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternatives perform when faced with easier ones.

MLNov 17, 2014
Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations

Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos et al.

We propose and analyze estimators for statistical functionals of one or more distributions under nonparametric assumptions. Our estimators are based on the theory of influence functions, which appear in the semiparametric statistics literature. We show that estimators based either on data-splitting or a leave-one-out technique enjoy fast rates of convergence and other favorable theoretical properties. We apply this framework to derive estimators for several popular information theoretic quantities, and via empirical evaluation, show the advantage of this approach over existing estimators.

MLOct 30, 2014
On Estimating $L_2^2$ Divergence

Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos et al.

We give a comprehensive theoretical characterization of a nonparametric estimator for the $L_2^2$ divergence between two continuous distributions. We first bound the rate of convergence of our estimator, showing that it is $\sqrt{n}$-consistent provided the densities are sufficiently smooth. In this smooth regime, we then show that our estimator is asymptotically normal, construct asymptotic confidence intervals, and establish a Berry-Esséen style inequality characterizing the rate of convergence to normality. We also show that this estimator is minimax optimal.

MEAug 6, 2014
The functional mean-shift algorithm for mode hunting and clustering in infinite dimensions

Mattia Ciollaro, Christopher Genovese, Jing Lei et al.

We introduce the functional mean-shift algorithm, an iterative algorithm for estimating the local modes of a surrogate density from functional data. We show that the algorithm can be used for cluster analysis of functional data. We propose a test based on the bootstrap for the significance of the estimated local modes of the surrogate density. We present two applications of our methodology. In the first application, we demonstrate how the functional mean-shift algorithm can be used to perform spike sorting, i.e. cluster neural activity curves. In the second application, we use the functional mean-shift algorithm to distinguish between original and fake signatures.

GAJun 29, 2014
Estimating the distribution of Galaxy Morphologies on a continuous space

Giuseppe Vinci, Peter Freeman, Jeffrey Newman et al.

The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution estimation and manifold estimation.

STJun 9, 2014
Feature Selection For High-Dimensional Clustering

Larry Wasserman, Martin Azizyan, Aarti Singh

We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.

STJun 9, 2014
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

Martin Azizyan, Aarti Singh, Larry Wasserman

We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.

MLJun 9, 2014
On the Decreasing Power of Kernel and Distance based Nonparametric Hypothesis Tests in High Dimensions

Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos et al.

This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness of estimation of test statistics from the hardness of testing whether these statistics are zero or not, and explicitly discuss a notion of "fair" alternative hypotheses for these problems as dimension increases. We then demonstrate that the power of these tests actually drops polynomially with increasing dimension against fair alternatives. We end with some theoretical insights and shed light on the \textit{median heuristic} for kernel bandwidth selection. Our work advances the current understanding of the power of modern nonparametric hypothesis tests in high dimensions.

MEJun 6, 2014
A Comprehensive Approach to Mode Clustering

Yen-Chi Chen, Christopher R. Genovese, Larry Wasserman

Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples

MLFeb 12, 2014
Nonparametric Estimation of Renyi Divergence and Friends

Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos et al.

We consider nonparametric estimation of $L_2$, Renyi-$α$ and Tsallis-$α$ divergences between continuous distributions. Our approach is to construct estimators for particular integral functionals of two densities and translate them into divergence estimators. For the integral functionals, our estimators are based on corrections of a preliminary plug-in estimator. We show that these estimators achieve the parametric convergence rate of $n^{-1/2}$ when the densities' smoothness, $s$, are both at least $d/4$ where $d$ is the dimension. We also derive minimax lower bounds for this problem which confirm that $s > d/4$ is necessary to achieve the $n^{-1/2}$ rate of convergence. We validate our theoretical guarantees with a number of simulations.

MEDec 29, 2013
Nonparametric Inference For Density Modes

Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli et al.

We derive nonparametric confidence intervals for the eigenvalues of the Hessian at modes of a density estimate. This provides information about the strength and shape of modes and can also be used as a significance test. We use a data-splitting approach in which potential modes are identified using the first half of the data and inference is done with the second half of the data. To get valid confidence sets for the eigenvalues, we use a bootstrap based on an elementary-symmetric-polynomial (ESP) transformation. This leads to valid bootstrap confidence sets regardless of any multiplicities in the eigenvalues. We also suggest a new method for bandwidth selection, namely, choosing the bandwidth to maximize the number of significant modes. We show by example that this method works well. Even when the true distribution is singular, and hence does not have a density, (in which case cross validation chooses a zero bandwidth), our method chooses a reasonable bandwidth.

STSep 26, 2013
Estimating Undirected Graphs Under Weak Assumptions

Larry Wasserman, Mladen Kolar, Alessandro Rinaldo

We consider the problem of providing nonparametric confidence guarantees for undirected graphs under weak assumptions. In particular, we do not assume sparsity, incoherence or Normality. We allow the dimension $D$ to increase with the sample size $n$. First, we prove lower bounds that show that if we want accurate inferences with low assumptions then there are limitations on the dimension as a function of sample size. When the dimension increases slowly with sample size, we show that methods based on Normal approximations and on the bootstrap lead to valid inferences and we provide Berry-Esseen bounds on the accuracy of the Normal approximation. When the dimension is large relative to sample size, accurate inferences for graphs under low assumptions are not possible. Instead we propose to estimate something less demanding than the entire partial correlation graph. In particular, we consider: cluster graphs, restricted partial correlation graphs and correlation graphs.

MLJul 29, 2013
Tight Lower Bounds for Homology Inference

Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh et al.

The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold. In earlier work, we have considered the statistical problem of estimating the homology of a manifold from noiseless samples and from noisy samples under several different noise models. We derived upper and lower bounds on the minimax risk for this problem. In this note we revisit the noiseless case. In previous work we used Le Cam's lemma to establish a lower bound that differed from the upper bound of Niyogi, Smale and Weinberger by a polynomial factor in the condition number. In this note we use a different construction based on the direct analysis of the likelihood ratio test to show that the upper bound of Niyogi, Smale and Weinberger is in fact tight, thus establishing rate optimal asymptotic minimax bounds for the problem. The techniques we use here extend in a straightforward way to the noisy settings considered in our earlier work.

MLJul 24, 2013
Cluster Trees on Manifolds

Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo et al.

In this paper we investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\mathbb{R}^D$. We analyze a modified version of a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. The main results of this paper show that under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also show that similar (albeit non-algorithmic) results can be obtained for kernel density estimators. We sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms. We further briefly consider the known manifold case and show that in this case a spatially adaptive algorithm achieves better rates.

MLJun 9, 2013
Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Martin Azizyan, Aarti Singh, Larry Wasserman

While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering.

STMar 28, 2013
Confidence sets for persistence diagrams

Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo et al.

Persistent homology is a method for probing topological properties of point clouds and functions. The method involves tracking the birth and death of topological features (2000) as one varies a tuning parameter. Features with short lifetimes are informally considered to be "topological noise," and those with a long lifetime are considered to be "topological signal." In this paper, we bring some statistical ideas to persistent homology. In particular, we derive confidence sets that allow us to separate topological signal from topological noise.

MLFeb 26, 2013
A Conformal Prediction Approach to Explore Functional Data

Jing Lei, Alessandro Rinaldo, Larry Wasserman

This paper applies conformal prediction techniques to compute simultaneous prediction bands and clustering trees for functional data. These tools can be used to detect outliers and clusters. Both our prediction bands and clustering trees provide prediction sets for the underlying stochastic process with a guaranteed finite sample behavior, under no distributional assumptions. The prediction sets are also informative in that they correspond to the high density region of the underlying process. While ordinary conformal prediction has high computational cost for functional data, we use the inductive conformal predictor, together with several novel choices of conformity scores, to simplify the computation. Our methods are illustrated on some real data examples.

MLFeb 1, 2013
Distribution-Free Distribution Regression

Barnabas Poczos, Alessandro Rinaldo, Aarti Singh et al.

`Distribution regression' refers to the situation where a response Y depends on a covariate P where P is a probability distribution. The model is Y=f(P) + mu where f is an unknown regression function and mu is a random error. Typically, we do not observe P directly, but rather, we observe a sample from P. In this paper we develop theory and methods for distribution-free versions of distribution regression. This means that we do not make distributional assumptions about the error term mu and covariate P. We prove that when the effective dimension is small enough (as measured by the doubling dimension), then the excess prediction risk converges to zero with a polynomial rate.

MEJun 27, 2012
The Nonparanormal SKEPTIC

Han Liu, Fang Han, Ming Yuan et al.

We propose a semiparametric approach, named nonparanormal skeptic, for estimating high dimensional undirected graphical models. In terms of modeling, we consider the nonparanormal family proposed by Liu et al (2009). In terms of estimation, we exploit nonparametric rank-based correlation coefficient estimators including the Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This result suggests that the nonparanormal graphical models are a safe replacement of the Gaussian graphical models, even when the data are Gaussian.