Learning A Disentangling Representation For PU Learning
This addresses the challenge of PU learning in high-dimensional settings, where existing methods degrade, offering a domain-specific solution for binary classification tasks.
The paper tackles the problem of binary classification with Positive and Unlabeled (PU) data by proposing a neural network-based representation learning method that projects unlabeled data into separable positive and negative clusters, demonstrating improved performance over state-of-the-art approaches in experiments on simulated data.
In this paper, we address the problem of learning a binary (positive vs. negative) classifier given Positive and Unlabeled data commonly referred to as PU learning. Although rudimentary techniques like clustering, out-of-distribution detection, or positive density estimation can be used to solve the problem in low-dimensional settings, their efficacy progressively deteriorates with higher dimensions due to the increasing complexities in the data distribution. In this paper we propose to learn a neural network-based data representation using a loss function that can be used to project the unlabeled data into two (positive and negative) clusters that can be easily identified using simple clustering techniques, effectively emulating the phenomenon observed in low-dimensional settings. We adopt a vector quantization technique for the learned representations to amplify the separation between the learned unlabeled data clusters. We conduct experiments on simulated PU data that demonstrate the improved performance of our proposed method compared to the current state-of-the-art approaches. We also provide some theoretical justification for our two cluster-based approach and our algorithmic choices.