MLLGJun 23, 2023

Revisiting inference after prediction

arXiv:2306.13746v211 citationsh-index: 16Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a critical issue in statistical inference for researchers using machine learning predictions, but it is incremental as it builds on and compares existing methods.

The paper tackles the problem of conducting valid inference on associations between unobserved response variables and covariates after using pre-trained machine learning models for prediction, showing that one method controls type 1 error and provides correct confidence intervals regardless of model quality, while another only works under unrealistic conditions.

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes