Semi-Supervised Active Clustering with Weak Oracles
This work addresses the challenge of reducing human effort in clustering tasks for domains requiring expert input, though it appears incremental by extending existing models with weak oracle assumptions.
The paper tackles the problem of semi-supervised active clustering by allowing 'not-sure' answers from weak oracles, proposing algorithms that achieve effective k-means clustering with high probability using a small query complexity, particularly when cluster margins are tight (e.g., γ≈1).
Semi-supervised active clustering (SSAC) utilizes the knowledge of a domain expert to cluster data points by interactively making pairwise "same-cluster" queries. However, it is impractical to ask human oracles to answer every pairwise query. In this paper, we study the influence of allowing "not-sure" answers from a weak oracle and propose algorithms to efficiently handle uncertainties. Different types of model assumptions are analyzed to cover realistic scenarios of oracle abstraction. In the first model, random-weak oracle, an oracle randomly abstains with a certain probability. We also proposed two distance-weak oracle models which simulate the case of getting confused based on the distance between two points in a pairwise query. For each weak oracle model, we show that a small query complexity is adequate for the effective $k$ means clustering with high probability. Sufficient conditions for the guarantee include a $γ$-margin property of the data, and an existence of a point close to each cluster center. Furthermore, we provide a sample complexity with a reduced effect of the cluster's margin and only a logarithmic dependency on the data dimension. Our results allow significantly less number of same-cluster queries if the margin of the clusters is tight, i.e. $γ\approx 1$. Experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties.