Semi-supervised cross-entropy clustering with information bottleneck constraint
This work addresses semi-supervised clustering for users needing efficient and robust methods with partial labeling, though it appears incremental as it builds on existing techniques like CEC and IB.
The paper tackles the problem of semi-supervised clustering by proposing CEC-IB, which combines cross-entropy clustering with an information bottleneck constraint to balance data modeling accuracy, model simplicity, and consistency with partial labels; experiments show it is faster and more robust to noisy labels than Gaussian mixture models, automatically determines the optimal number of clusters, and performs well with incomplete side information.
In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CEC-IB has a performance comparable to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to other semi-supervised models, it can be successfully applied in discovering natural subgroups if the partition-level side information is derived from the top levels of a hierarchical clustering.