LGMLMar 26, 2021

On UMAP's true loss function

arXiv:2103.14608v252 citations
Originality Incremental advance
AI Analysis

This work clarifies the underlying mechanics of UMAP, a widely used visualization tool, challenging previous understanding and potentially impacting researchers in data science and bioinformatics.

The paper investigates UMAP's optimization scheme, deriving its true loss function and showing it differs from the published version, revealing that UMAP reproduces similarities based on the shared k-nearest neighbor graph rather than its theoretical high-dimensional similarities, with evidence from toy and single-cell RNA sequencing data.

UMAP has supplanted t-SNE as state-of-the-art for visualizing high-dimensional datasets in many disciplines, but the reason for its success is not well understood. In this work, we investigate UMAP's sampling based optimization scheme in detail. We derive UMAP's effective loss function in closed form and find that it differs from the published one. As a consequence, we show that UMAP does not aim to reproduce its theoretically motivated high-dimensional UMAP similarities. Instead, it tries to reproduce similarities that only encode the shared $k$ nearest neighbor graph, thereby challenging the previous understanding of UMAP's effectiveness. Instead, we claim that the key to UMAP's success is its implicit balancing of attraction and repulsion resulting from negative sampling. This balancing in turn facilitates optimization via gradient descent. We corroborate our theoretical findings on toy and single cell RNA sequencing data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes