11.0STAT-MECHMay 13
Beyond Explained Variance: A Cautionary Tale of PCAGionni Marchetti
We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 < 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on tt t-SNE and persistent homology.
STAT-MECHJan 27
Learning the Intrinsic Dimensionality of Fermi-Pasta-Ulam-Tsingou Trajectories: A Nonlinear Approach using a Deep Autoencoder ModelGionni Marchetti
We address the intrinsic dimensionality (ID) of high-dimensional trajectories, comprising $n_s = 4\,000\,000$ data points, of the Fermi-Pasta-Ulam-Tsingou (FPUT) $β$ model with $N = 32$ oscillators. To this end, a deep autoencoder (DAE) model is employed to infer the ID in the weakly nonlinear regime ($β\lesssim 1$). We find that the trajectories lie on a nonlinear manifold of dimension $m^{\ast} = 2$ embedded in a $64$-dimensional phase space. The DAE further reveals that this dimensionality increases to $m^{\ast} = 3$ at $β= 1.1$, coinciding with a symmetry breaking transition, in which additional energy modes with even wave numbers $k = 2, 4$ become excited. Finally, we discuss the limitations of the linear approach based on principal component analysis (PCA), which fails to capture the underlying structure of the data and therefore yields unreliable results in most cases.
SOC-PHApr 27, 2025
Metric Similarity and Manifold Learning of Circular Dichroism Spectra of ProteinsGionni Marchetti
We present a machine learning analysis of circular dichroism spectra of globular proteins from the SP175 database, using the optimal transport-based $1$-Wasserstein distance $\mathcal{W}_1$ (with order $p=1$) and the manifold learning algorithm $t$-SNE. Our results demonstrate that $\mathcal{W}_1$ is consistent with both Euclidean and Manhattan metrics while exhibiting robustness to noise. On the other hand, $t$-SNE uncovers meaningful structure in the high-dimensional data. The clustering in the $t$-SNE embedding is primarily determined by proteins with distinct secondary structure compositions: one cluster predominantly contains $β$-rich proteins, while the other consists mainly of proteins with mixed $α/β$ and $α$-helical content.
LGNov 4, 2024
Intrinsic Dimensionality of Fermi-Pasta-Ulam-Tsingou High-Dimensional Trajectories Through Manifold Learning: A Linear ApproachGionni Marchetti
A data-driven approach based on unsupervised machine learning is proposed to infer the intrinsic dimension $m^{\ast}$ of the high-dimensional trajectories of the Fermi-Pasta-Ulam-Tsingou (FPUT) model. Principal component analysis (PCA) is applied to trajectory data consisting of $n_s = 4,000,000$ datapoints, of the FPUT $β$ model with $N = 32$ coupled oscillators, revealing a critical relationship between $m^{\ast}$ and the model's nonlinear strength. By estimating the intrinsic dimension $m^{\ast}$ using multiple methods (participation ratio, Kaiser rule, and the Kneedle algorithm), it is found that $m^{\ast}$ increases with the model nonlinearity. Interestingly, in the weakly nonlinear regime, for trajectories initialized by exciting the first mode, the participation ratio estimates $m^{\ast} = 2, 3$, strongly suggesting that quasi-periodic motion on a low-dimensional Riemannian manifold underlies the characteristic energy recurrences observed in the FPUT model.