LG CC ST MLApr 28, 2025

Learning High-dimensional Gaussians from Censored Data

Arnab Bhattacharyya, Constantinos Daskalakis, Themis Gouleakis, Yuhao Wang

arXiv:2504.19446v17.11 citationsh-index: 13AISTATS

Originality Incremental advance

AI Analysis

This addresses the challenge of distribution learning with non-random missing data in high dimensions, which is incremental as it builds on prior work with specific missingness models.

The paper tackles the problem of learning high-dimensional Gaussian distributions from censored data where variables are missing not at random, providing efficient algorithms for two settings: self-censoring learns the Gaussian up to total variation distance ε with poly(d, 1/ε) samples, and linear thresholding enables efficient mean estimation under certain probability assumptions.

We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are missing not at random (MNAR). The missingness model, denoted by $S(y)$, is the function that maps any point $y$ in $R^d$ to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings: (i) Self-censoring: An observation $x$ is generated by first sampling the true value $y$ from a $d$-dimensional Gaussian $N(μ*, Σ*)$ with unknown $μ*$ and $Σ*$. For each coordinate $i$, there exists a set $S_i$ subseteq $R^d$ such that $x_i = y_i$ if and only if $y_i$ in $S_i$. Otherwise, $x_i$ is missing and takes a generic value (e.g., "?"). We design an algorithm that learns $N(μ*, Σ*)$ up to total variation (TV) distance epsilon, using $poly(d, 1/ε)$ samples, assuming only that each pair of coordinates is observed with sufficiently high probability. (ii) Linear thresholding: An observation $x$ is generated by first sampling $y$ from a $d$-dimensional Gaussian $N(μ*, Σ)$ with unknown $μ*$ and known $Σ$, and then applying the missingness model $S$ where $S(y) = {i in [d] : v_i^T y <= b_i}$ for some $v_1, ..., v_d$ in $R^d$ and $b_1, ..., b_d$ in $R$. We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability.

View on arXiv PDF

Similar