LGHCMLOct 6, 2021

T-SNE Is Not Optimized to Reveal Clusters in Data

arXiv:2110.02573v15 citations
Originality Synthesis-oriented
AI Analysis

This challenges a common belief in data analysis, potentially affecting researchers and practitioners using t-SNE for cluster visualization, though it is incremental as it critiques an existing method.

The paper demonstrates that t-SNE can fail to reveal clusters in data even when the data is well-clusterable, contradicting prior theoretical guarantees, and provides empirical counter-examples showing that optimizing the t-SNE objective does not improve cluster visualization.

Cluster visualization is an essential task for nonlinear dimensionality reduction as a data analysis tool. It is often believed that Student t-Distributed Stochastic Neighbor Embedding (t-SNE) can show clusters for well clusterable data, with a smaller Kullback-Leibler divergence corresponding to a better quality. There was even theoretical proof for the guarantee of this property. However, we point out that this is not necessarily the case -- t-SNE may leave clustering patterns hidden despite strong signals present in the data. Extensive empirical evidence is provided to support our claim. First, several real-world counter-examples are presented, where t-SNE fails even if the input neighborhoods are well clusterable. Tuning hyperparameters in t-SNE or using better optimization algorithms does not help solve this issue because a better t-SNE learning objective can correspond to a worse cluster embedding. Second, we check the assumptions in the clustering guarantee of t-SNE and find they are often violated for real-world data sets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes