Beyond Explained Variance: A Cautionary Tale of PCA

arXiv:2605.135201.9

Predicted impact top 90% in STAT-MECH · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work warns practitioners in data visualization and manifold learning about the limitations of PCA for nonlinear data, though the findings are incremental as the issue is known.

The paper shows that PCA can produce misleading visualizations for nonlinear low-dimensional manifolds, as demonstrated on a fossil teeth dataset where PCA suggests clustering but t-SNE and persistent homology reveal a ring-like structure with intrinsic dimension one.

We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 < 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on tt t-SNE and persistent homology.

View on arXiv PDF

Similar