MLLGSTMEMar 2, 2023

In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for Self-Training in Semi-Supervised Learning

arXiv:2303.01117v1h-index: 24
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving generalization in semi-supervised learning for practitioners by making pseudo-label selection more robust, though it appears incremental as it builds on existing self-training methods.

The paper tackles the problem of selecting pseudo-labeled data in self-training for semi-supervised learning by proposing a robust method that accounts for uncertainties like model selection, error accumulation, and covariate shift, resulting in substantial accuracy gains, particularly in robustness to model choice.

Self-training is a simple yet effective method within semi-supervised learning. The idea is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions. To this end, we propose to select pseudo-labeled data that maximize a multi-objective utility function. The latter is constructed to account for different sources of uncertainty, three of which we discuss in more detail: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian alpha-cut updating rule for credal sets. As a practical proof of concept, we spotlight the application of three of our robust extensions on simulated and real-world data. Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes