LGJun 3

Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness

arXiv:2606.0507370.7
AI Analysis

For practitioners dealing with missing data in tabular datasets, this work addresses the overlooked problem of distinguishing between different types of missingness, offering a principled approach.

The paper formalizes selective imputation, distinguishing between meaningfully missing entries and those missing due to observation, and proposes Diff-Joint, a diffusion-based framework that jointly models data and missingness masks. Experiments show it identifies meaningful missingness while achieving competitive imputation accuracy and improved downstream performance.

Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes