MLLGDec 5, 2025

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

arXiv:2512.05456v12 citations
Originality Synthesis-oriented
AI Analysis

This addresses a growing challenge for researchers facing data collection obstacles, though it is incremental as it builds on classical statistical theory.

The paper tackles the problem of drawing statistical inference using predicted data as substitutes for missing observations, showing that high predictive accuracy does not ensure valid inference due to bias and variance issues.

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes