Dealing with missing data using attention and latent space regularization
This addresses missing data issues in data science, offering a novel approach that could benefit various applications, though it appears incremental as it builds on existing regularization and attention techniques.
The paper tackles the problem of missing data in datasets by developing a theoretical framework that avoids imputation, using attention and latent space regularization to reduce bias; it demonstrates improved performance over state-of-the-art methods on 11 benchmarking datasets and 18 corrupted datasets.
Most practical data science problems encounter missing data. A wide variety of solutions exist, each with strengths and weaknesses that depend upon the missingness-generating process. Here we develop a theoretical framework for training and inference using only observed variables enabling modeling of incomplete datasets without imputation. Using an information and measure-theoretic argument we construct models with latent space representations that regularize against the potential bias introduced by missing data. The theoretical properties of this approach are demonstrated empirically using a synthetic dataset. The performance of this approach is tested on 11 benchmarking datasets with missingness and 18 datasets corrupted across three missingness patterns with comparison against a state-of-the-art model and industry-standard imputation. We show that our proposed method overcomes the weaknesses of imputation methods and outperforms the current state-of-the-art.