LGAIOct 31, 2022

Diffusion models for missing value imputation in tabular data

arXiv:2210.17128v2118 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses a practical data preprocessing problem for machine learning practitioners, but it is incremental as it adapts diffusion models from other domains to tabular data.

The paper tackles missing value imputation in tabular data by proposing TabCSDI, a diffusion model approach that effectively handles both categorical and numerical variables, achieving improved performance over existing methods on benchmark datasets.

Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (TabCSDI). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of TabCSDI compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes