t-SNE Exaggerates Clusters, Provably
This addresses a critical issue for users of t-SNE in data visualization and analysis, revealing fundamental limitations in a widely used tool.
The paper proves that t-SNE visualizations can misrepresent the strength of input clusters and the extremity of outliers, making them unreliable for inferring these properties, and demonstrates these failures in practice.
Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.