LGMar 5, 2018

An Analysis of the t-SNE Algorithm for Data Visualization

arXiv:1803.01768v2210 citations
Originality Incremental advance
AI Analysis

This work offers theoretical foundations for a widely used visualization tool, addressing a problem for researchers and practitioners in data analysis, though it is incremental as it builds on existing analyses.

The paper provides a formal framework for data visualization and gives the first provable guarantees for the t-SNE algorithm, showing it correctly separates clusters under deterministic conditions and partially recovers cluster structure even when those conditions are not met.

A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the "ground-truth" clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in partially recovering cluster structure even when the above deterministic condition is not met.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes