MLLGDec 4, 2025

Informative missingness and its implications in semi-supervised learning

arXiv:2512.04392v111 citationsh-index: 9The Innovation Informatics
AI Analysis

This addresses a statistical challenge in semi-supervised learning for researchers and practitioners, offering a novel framework that is incremental in integrating missingness modeling.

The paper tackles the problem of semi-supervised learning with informative missing labels, showing that modeling this missingness can yield classifiers with smaller expected error than using fully labeled data, especially under moderate class overlap and sparse labels.

Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance prediction performance. This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models that can be fitted using the expectation-maximisation (EM) algorithm. Ideally, one would prefer a completely labelled sample, as one would anticipate that a labelled observation provides more information than an unlabelled one. However, when the mechanism governing label absence depends on the observed features or the class labels or both, the missingness indicators themselves contain useful information. In certain situations, the information gained from modelling the missing-label mechanism can even outweigh the loss due to missing labels, yielding a classifier with a smaller expected error than one based on a completely labelled sample analysed. This improvement arises particularly when class overlap is moderate, labelled data are sparse, and the missingness is informative. Modelling such informative missingness thus offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes