Anav Sood

h-index2
2papers

2 Papers

MEJul 24, 2023
A Statistical View of Column Subset Selection

Anav Sood, Trevor Hastie

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

MEMar 3, 2025
Powerful rank verification for multivariate Gaussian data with any covariance structure

Anav Sood

Upon observing $n$-dimensional multivariate Gaussian data, when can we infer that the largest $K$ observations came from the largest $K$ means? When $K=1$ and the covariance is isotropic, \cite{Gutmann} argue that this inference is justified when the two-sided difference-of-means test comparing the largest and second largest observation rejects. Leveraging tools from selective inference, we provide a generalization of their procedure that applies for both any $K$ and any covariance structure. We show that our procedure draws the desired inference whenever the two-sided difference-of-means test comparing the pair of observations inside and outside the top $K$ with the smallest standardized difference rejects, and sometimes even when this test fails to reject. Using this insight, we argue that our procedure renders existing simultaneous inference approaches inadmissible when $n > 2$. When the observations are independent (with possibly unequal variances) or equicorrelated, our procedure corresponds exactly to running the two-sided difference-of-means test comparing the pair of observations inside and outside the top $K$ with the smallest standardized difference.