LGDec 4, 2023

Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios

arXiv:2312.02095v33.82 citationsh-index: 2Fundamenta Informaticae

Originality Synthesis-oriented

AI Analysis

This work addresses a practical issue for machine learning practitioners dealing with positive unlabeled data, but it is incremental as it adapts existing methods to different sampling schemes.

The paper investigates how classifiers for positive unlabeled data, designed for case-control sampling, perform poorly in single-sample scenarios, showing significant differences especially when half or more positives are labeled, and proposes a modified risk definition to address this.

In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.

View on arXiv PDF

Similar