Semi-Supervised Clustering with Inaccurate Pairwise Annotations
This work addresses the challenge of clustering in domains where accurate class labels are hard to obtain, but it is incremental as it builds on existing semi-supervised clustering methods by incorporating inaccuracies.
The paper tackles the problem of clustering with inaccurate pairwise annotations by proposing a generative model that incorporates must-link and cannot-link relations, showing that accounting for relational information significantly improves clustering performance even with weak and inaccurate supervision.
Pairwise relational information is a useful way of providing partial supervision in domains where class labels are difficult to acquire. This work presents a clustering model that incorporates pairwise annotations in the form of must-link and cannot-link relations and considers possible annotation inaccuracies (i.e., a common setting when experts provide pairwise supervision). We propose a generative model that assumes Gaussian-distributed data samples along with must-link and cannot-link relations generated by stochastic block models. We adopt a maximum-likelihood approach and demonstrate that, even when supervision is weak and inaccurate, accounting for relational information significantly improves clustering performance. Relational information also helps to detect meaningful groups in real-world datasets that do not fit the original data-distribution assumptions. Additionally, we extend the model to integrate prior knowledge of experts' accuracy and discuss circumstances in which the use of this knowledge is beneficial.