MLAug 19, 2024
Robust spectral clustering with rank statisticsJoshua Cape, Xianshi Yu, Jonquil Z. Liao
This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile. Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure.
STJul 26, 2025
Extreme value theory for singular subspace estimation in the matrix denoising modelJunhyung Chang, Joshua Cape
This paper studies fine-grained singular subspace estimation in the matrix denoising model where a deterministic low-rank signal matrix is additively perturbed by a stochastic matrix of Gaussian noise. We establish that the maximum Euclidean row norm (i.e., the two-to-infinity norm) of the aligned difference between the leading sample and population singular vectors approaches the Gumbel distribution in the large-matrix limit, under suitable signal-to-noise conditions and after appropriate centering and scaling. We apply our novel asymptotic distributional theory to test hypotheses of low-rank signal structure encoded in the leading singular vectors and their corresponding principal subspace. We provide de-biased estimators for the corresponding nuisance signal singular values and show that our proposed plug-in test statistic has desirable properties. Notably, compared to using the Frobenius norm subspace distance, our test statistic based on the two-to-infinity norm has higher power to detect structured alternatives that differ from the null in only a few matrix entries or rows. Our main results are obtained by a novel synthesis of and technical analysis involving entrywise matrix perturbation analysis, extreme value theory, saddle point approximation methods, and random matrix theory. Our contributions complement the existing literature for matrix denoising focused on minimaxity, mean squared error analysis, unitarily invariant distances between subspaces, component-wise asymptotic distributional theory, and row-wise uniform error bounds. Numerical simulations illustrate our main results and demonstrate the robustness properties of our testing procedure to non-Gaussian noise distributions.
MEFeb 8, 2022
Spectral embedding and the latent geometry of multipartite networksAlexander Modell, Ian Gallagher, Joshua Cape et al.
Spectral embedding finds vector representations of the nodes of a network, based on the eigenvectors of a properly constructed matrix, and has found applications throughout science and technology. Many networks are multipartite, meaning that they contain nodes of fundamentally different types, e.g. drugs, diseases and proteins, and edges are only observed between nodes of different types. When the network is multipartite, this paper demonstrates that the node representations obtained via spectral embedding lie near type-specific low-dimensional subspaces of a higher-dimensional ambient space. For this reason we propose a follow-on step after spectral embedding, to recover node representations in their intrinsic rather than ambient dimension, proving uniform consistency under a low-rank, inhomogeneous random graph model. We demonstrate the performance of our procedure on a large 6-partite biomedical network relevant for drug discovery.
SIJul 4, 2020
On spectral algorithms for community detection in stochastic blockmodel graphs with vertex covariatesCong Mu, Angelo Mele, Lingxin Hao et al.
In network inference applications, it is often desirable to detect community structure, namely to cluster vertices into groups, or blocks, according to some measure of similarity. Beyond mere adjacency matrices, many real networks also involve vertex covariates that carry key information about underlying block structure in graphs. To assess the effects of such covariates on block recovery, we present a comparative analysis of two model-based spectral algorithms for clustering vertices in stochastic blockmodel graphs with vertex covariates. The first algorithm uses only the adjacency matrix, and directly estimates the block assignments. The second algorithm incorporates both the adjacency matrix and the vertex covariates into the estimation of block assignments, and moreover quantifies the explicit impact of the vertex covariates on the resulting estimate of the block assignments. We employ Chernoff information to analytically compare the algorithms' performance and derive the information-theoretic Chernoff ratio for certain models of interest. Analytic results and simulations suggest that the second algorithm is often preferred: we can often better estimate the induced block assignments by first estimating the effect of vertex covariates. In addition, real data examples also indicate that the second algorithm has the advantages of revealing underlying block structure and taking observed vertex heterogeneity into account in real applications. Our findings emphasize the importance of distinguishing between observed and unobserved factors that can affect block structure in graphs.
MEAug 18, 2019
Spectral inference for large Stochastic Blockmodels with nodal covariatesAngelo Mele, Lingxin Hao, Joshua Cape et al.
In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. To this end, we develop spectral estimators for both unobserved blocks and the effect of covariates in stochastic blockmodels. On the theoretical side, we establish asymptotic normality of our estimators for the subsequent purpose of performing inference. On the applied side, we show that computing our estimator is much faster than standard variational expectation--maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. The results in this paper provide a foundation for spectral estimation of the effect of observed covariates as well as unobserved latent community structure on the probability of link formation in networks.
MLAug 23, 2018
On a 'Two Truths' Phenomenon in Spectral Graph ClusteringCarey E. Priebe, Youngser Park, Joshua T. Vogelstein et al.
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical results provide new understanding of the problem and solutions, and lead us to a 'Two Truths' LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome data set: the different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core-periphery structure.
MLSep 16, 2017
A statistical interpretation of spectral embedding: the generalised random dot product graphPatrick Rubin-Delanchy, Joshua Cape, Minh Tang et al.
Spectral embedding is a procedure which can be used to obtain vector representations of the nodes of a graph. This paper proposes a generalisation of the latent position network model known as the random dot product graph, to allow interpretation of those vector representations as latent position estimates. The generalisation is needed to model heterophilic connectivity (e.g., `opposites attract') and to cope with negative eigenvalues more generally. We show that, whether the adjacency or normalised Laplacian matrix is used, spectral embedding produces uniformly consistent latent position estimates with asymptotically Gaussian error (up to identifiability). The standard and mixed membership stochastic block models are special cases in which the latent positions take only $K$ distinct vector values, representing communities, or live in the $(K-1)$-simplex with those vertices, respectively. Under the stochastic block model, our theory suggests spectral clustering using a Gaussian mixture model (rather than $K$-means) and, under mixed membership, fitting the minimum volume enclosing simplex, existing recommendations previously only supported under non-negative-definite assumptions. Empirical improvements in link prediction (over the random dot product graph), and the potential to uncover richer latent structure (than posited under the standard or mixed membership stochastic block models) are demonstrated in a cyber-security example.