Spatially Robust Inference with Predicted and Missing at Random Labels

Stephen Salerno, Zhenke Wu, Tyler McCormick

arXiv:2603.11368v17.0h-index: 25

Predicted impact top 29% in ML · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses inference challenges for scientists using AI predictions in spatially dependent data with missing labels, representing an incremental improvement over existing methods.

The paper tackles the problem of statistical inference when using predicted labels from machine learning models in the presence of missing-at-random data and spatial dependence, proposing a doubly robust estimator with a jackknife spatial HAC variance correction that improves finite-sample calibration in simulations and benchmarks.

When outcome data are expensive or onerous to collect, scientists increasingly substitute predictions from machine learning and AI models for unlabeled cases, a process which has consequences for downstream statistical inference. While recent methods provide valid uncertainty quantification under independent sampling, real-world applications involve missing at random (MAR) labeling and spatial dependence. For inference in this setting, we propose a doubly robust estimator with cross-fit nuisances. We show that cross-fitting induces fold-level correlation that distorts spatial variance estimators, producing unstable or overly conservative confidence intervals. To address this, we propose a jackknife spatial heteroscedasticity and autocorrelation consistent (HAC) variance correction that separates spatial dependence from fold-induced noise. Under standard identification and dependence conditions, the resulting intervals are asymptotically valid. Simulations and benchmark datasets show substantial improvement in finite-sample calibration, particularly under MAR labeling and clustered sampling.

View on arXiv PDF

Similar