Metric Similarity and Manifold Learning of Circular Dichroism Spectra of Proteins
This work addresses protein structure analysis for bioinformatics researchers, but it is incremental as it applies existing methods to a specific dataset.
The study tackled the analysis of circular dichroism spectra of globular proteins using machine learning, finding that the optimal transport-based Wasserstein distance is robust to noise and consistent with other metrics, while t-SNE clustering reveals distinct protein groups based on secondary structure compositions.
We present a machine learning analysis of circular dichroism spectra of globular proteins from the SP175 database, using the optimal transport-based $1$-Wasserstein distance $\mathcal{W}_1$ (with order $p=1$) and the manifold learning algorithm $t$-SNE. Our results demonstrate that $\mathcal{W}_1$ is consistent with both Euclidean and Manhattan metrics while exhibiting robustness to noise. On the other hand, $t$-SNE uncovers meaningful structure in the high-dimensional data. The clustering in the $t$-SNE embedding is primarily determined by proteins with distinct secondary structure compositions: one cluster predominantly contains $β$-rich proteins, while the other consists mainly of proteins with mixed $α/β$ and $α$-helical content.