CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation
This work addresses data quality issues in tabular datasets for data scientists and analysts, offering incremental improvements by integrating missingness patterns and contextual cues.
The paper tackles the problem of imputing missing values in tabular data by introducing CACTI, a masked autoencoding approach that uses copy masking and contextual information, achieving an average R^2 gain of 7.8% over state-of-the-art methods across various missingness conditions.
We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.