LG MLFeb 1, 2025

Improving realistic semi-supervised learning with doubly robust estimation

Khiem Pham, Charles Herrmann, Ramin Zabih

arXiv:2502.00279v14.1h-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of biased pseudo-labeling in semi-supervised learning for real-world applications with long-tailed data, representing an incremental improvement over existing methods.

The paper tackles the challenge of semi-supervised learning with long-tailed class distributions by proposing a doubly robust estimator to explicitly estimate the unlabeled class distribution, which improves the accuracy of pseudo-labeling methods in experiments.

A major challenge in Semi-Supervised Learning (SSL) is the limited information available about the class distribution in the unlabeled data. In many real-world applications this arises from the prevalence of long-tailed distributions, where the standard pseudo-label approach to SSL is biased towards the labeled class distribution and thus performs poorly on unlabeled data. Existing methods typically assume that the unlabeled class distribution is either known a priori, which is unrealistic in most situations, or estimate it on-the-fly using the pseudo-labels themselves. We propose to explicitly estimate the unlabeled class distribution, which is a finite-dimensional parameter, \emph{as an initial step}, using a doubly robust estimator with a strong theoretical guarantee; this estimate can then be integrated into existing methods to pseudo-label the unlabeled data during training more accurately. Experimental results demonstrate that incorporating our techniques into common pseudo-labeling approaches improves their performance.

View on arXiv PDF

Similar