ITAIDSLGSTJan 22, 2025

Guaranteed Recovery of Unambiguous Clusters

arXiv:2501.13093v3h-index: 7ISIT
Originality Highly original
AI Analysis

This work addresses the challenge of ambiguous cluster recovery in non-convex clustering for data analysis applications, representing an incremental advance with a novel method for a known bottleneck.

The paper tackles the problem of ambiguous clusterings in data with varying densities and separated high-density regions by proposing an information-theoretic characterization to define when a clustering is unambiguous and designing an algorithm that recovers it under such conditions. The algorithm, tested with modifications for overlapping clusters, shows improved performance on many datasets compared to existing methods, requiring little parameter selection.

Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the $K$-clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes