Clustering With Pairwise Relationships: A Generative Approach
This work addresses the challenge of improving clustering accuracy with limited user input in semi-supervised learning, though it appears incremental as it builds on existing constrained clustering methods.
The paper tackles the problem of constrained clustering by proposing a generative model that probabilistically incorporates user-defined pairwise relationships, resulting in a principled approach that avoids ad hoc penalties and yields a closed-form solution for standard distributions.
Semi-supervised learning (SSL) has become important in current data analysis applications, where the amount of unlabeled data is growing exponentially and user input remains limited by logistics and expense. Constrained clustering, as a subclass of SSL, makes use of user input in the form of relationships between data points (e.g., pairs of data points belonging to the same class or different classes) and can remarkably improve the performance of unsupervised clustering in order to reflect user-defined knowledge of the relationships between particular data points. Existing algorithms incorporate such user input, heuristically, as either hard constraints or soft penalties, which are separate from any generative or statistical aspect of the clustering model; this results in formulations that are suboptimal and not sufficiently general. In this paper, we propose a principled, generative approach to probabilistically model, without ad hoc penalties, the joint distribution given by user-defined pairwise relations. The proposed model accounts for general underlying distributions without assuming a specific form and relies on expectation-maximization for model fitting. For distributions in a standard form, the proposed approach results in a closed-form solution for updated parameters.