Extending Model-x Framework to Missing Data
This work addresses a limitation in statistical and machine learning methods for researchers and practitioners dealing with incomplete datasets, though it is incremental as it builds on the existing model-x framework.
The paper tackles the problem of controlling false selections in variable selection when data has missing values, extending the model-x knockoffs framework to handle missing data while preserving theoretical guarantees, and verifies the findings with experiments showing how factors like missing data pattern and correlation affect statistical power.
One limitation of the most statistical/machine learning-based variable selection approaches is their inability to control the false selections. A recently introduced framework, model-x knockoffs, provides that to a wide range of models but lacks support for datasets with missing values. In this work, we discuss ways of preserving the theoretical guarantees of the model-x framework in the missing data setting. First, we prove that posterior sampled imputation allows reusing existing knockoff samplers in the presence of missing values. Second, we show that sampling knockoffs only for the observed variables and applying univariate imputation also preserves the false selection guarantees. Third, for the special case of latent variable models, we demonstrate how jointly imputing and sampling knockoffs can reduce the computational complexity. We have verified the theoretical findings with two different exploratory variable distributions and investigated how the missing data pattern, amount of correlation, the number of observations, and missing values affected the statistical power.