LGMay 1, 2013

Clustering Unclustered Data: Unsupervised Binary Labeling of Two Datasets Having Different Class Balances

Marthinus Christoffel du Plessis, Masashi Sugiyama

arXiv:1305.0103v121 citations

Originality Incremental advance

AI Analysis

This addresses the problem of labeling unclustered data for researchers and practitioners in unsupervised learning, offering a novel approach but with incremental improvements over existing clustering methods.

The paper tackles the unsupervised labeling problem for binary classification by leveraging two unlabeled datasets with different class balances, showing that estimating the sign of the density difference between them provides a solution. It introduces a method to directly estimate this sign without full density estimation and demonstrates its effectiveness on toy problems and real-world datasets, outperforming several clustering methods.

We consider the unsupervised learning problem of assigning labels to unlabeled data. A naive approach is to use clustering methods, but this works well only when data is properly clustered and each cluster corresponds to an underlying class. In this paper, we first show that this unsupervised labeling problem in balanced binary cases can be solved if two unlabeled datasets having different class balances are available. More specifically, estimation of the sign of the difference between probability densities of two unlabeled datasets gives the solution. We then introduce a new method to directly estimate the sign of the density difference without density estimation. Finally, we demonstrate the usefulness of the proposed method against several clustering methods on various toy problems and real-world datasets.

View on arXiv PDF

Similar