LGMLJun 8, 2016

Clustering with Same-Cluster Queries

arXiv:1606.02404v2101 citations
AI Analysis

This addresses the computational complexity challenge in clustering for data scientists by enabling efficient solutions to otherwise intractable problems through limited expert interaction.

The paper tackles the NP-hard problem of k-means clustering by introducing a semi-supervised active clustering framework where queries to an expert about cluster membership are allowed, showing that with O(k^2 log k + k log n) queries, a probabilistic polynomial-time algorithm can efficiently solve it under margin conditions.

We propose a framework for Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a center-based clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of $k$-means clustering (i.e., when the expert conforms to a solution of $k$-means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems. In particular, we provide a probabilistic polynomial-time (BPP) algorithm for clustering in this setting that asks $O\big(k^2\log k + k\log n)$ same-cluster queries and runs with time complexity $O\big(kn\log n)$ (where $k$ is the number of clusters and $n$ is the number of instances). The algorithm succeeds with high probability for data satisfying margin conditions under which, without queries, we show that the problem is NP hard. We also prove a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes