LG AI QM MLAug 29, 2023

Navigating Perplexity: A linear relationship with the data set size in t-SNE embeddings

Martin Skrodzki, Nicolas F. Chaves-de-Plaza, Thomas Höllt, Elmar Eisemann, Klaus Hildebrandt

arXiv:2308.15513v22.01 citationsh-index: 41

Originality Synthesis-oriented

AI Analysis

This provides a practical guideline for users in data visualization and analysis to select perplexity more systematically, though it is incremental as it builds on existing t-SNE methods.

The paper tackles the problem of choosing the perplexity hyperparameter in t-SNE embeddings for high-dimensional data visualization, showing a linear relationship between perplexity and dataset size that maintains structural consistency across samples.

Widely used pipelines for analyzing high-dimensional data utilize two-dimensional visualizations. These are created, for instance, via t-distributed stochastic neighbor embedding (t-SNE). A crucial element of the t-SNE embedding procedure is the perplexity hyperparameter. That is because the embedding structure varies when perplexity is changed. A suitable perplexity choice depends on the data set and the intended usage for the embedding. Therefore, perplexity is often chosen based on heuristics, intuition, and prior experience. This paper uncovers a linear relationship between perplexity and the data set size. Namely, we show that embeddings remain structurally consistent across data set samples when perplexity is adjusted accordingly. Qualitative and quantitative experimental results support these findings. This informs the visualization process, guiding the user when choosing a perplexity value. Finally, we outline several applications for the visualization of high-dimensional data via t-SNE based on this linear relationship.

View on arXiv PDF

Similar