Improved Estimation of Class Prior Probabilities through Unlabeled Data
This work addresses a practical challenge in classification tasks where obtaining labeled data is difficult, offering a solution that enhances probability estimation for applications in fields like medical diagnosis or fraud detection.
The paper tackles the problem of estimating class prior probabilities when labeled data is scarce or expensive, by leveraging unlabeled data to improve estimation accuracy. It shows that using unlabeled observations reduces asymptotic variance and extends the methodology to subclass probabilities.
Work in the classification literature has shown that in computing a classification function, one need not know the class membership of all observations in the training set; the unlabeled observations still provide information on the marginal distribution of the feature set, and can thus contribute to increased classification accuracy for future observations. The present paper will show that this scheme can also be used for the estimation of class prior probabilities, which would be very useful in applications in which it is difficult or expensive to determine class membership. Both parametric and nonparametric estimators are developed. Asymptotic distributions of the estimators are derived, and it is proven that the use of the unlabeled observations does reduce asymptotic variance. This methodology is also extended to the estimation of subclass probabilities.