Stop Misusing t-SNE and UMAP for Visual Analytics
This addresses a critical issue for data scientists and analysts in avoiding misleading interpretations in visual analytics, though it is incremental as it builds on existing concerns about dimensionality reduction misuse.
The paper tackles the problem of widespread misuse of t-SNE and UMAP in visual analytics, where practitioners incorrectly interpret inter-cluster relationships from projections that do not faithfully reflect original distances, and finds that this stems from limited dimensionality reduction literacy among users, based on a review of 136 papers and interviews with researchers and experts.
Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect the original distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. We investigate why this misuse occurs, and discuss methods to prevent it. To that end, we first review 136 papers to verify the prevalence of the misuse. We then interview researchers who have used dimensionality reduction (DR) to understand why such misuse occurs. Finally, we interview DR experts to examine why previous efforts failed to address the misuse. We find that the misuse of t-SNE and UMAP stems primarily from limited DR literacy among practitioners, and that existing attempts to address this issue have been ineffective. Based on these insights, we discuss potential paths forward, including the controversial but pragmatic option of automating the selection of optimal DR projections to prevent misleading analyses.