LG MLDec 16, 2023

Continuous Diffusion for Mixed-Type Tabular Data

Markus Mueller, Kathrin Gruber, Dennis Fok

arXiv:2312.10431v515 citationsh-index: 2Has CodeICLR

Originality Incremental advance

AI Analysis

This addresses a gap in generative modeling for mixed-type tabular data, which is common in real-world applications like healthcare and finance, though it appears incremental as an adaptation of existing diffusion methods.

The authors tackled the problem of adapting diffusion models to mixed-type tabular data, proposing CDTD with novel noise distribution and adaptive noise schedules, which outperformed state-of-the-art benchmarks and captured feature correlations well.

Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. CDTD is based on a novel combination of score matching and score interpolation to enforce a unified continuous noise distribution for both continuous and categorical features. We explicitly acknowledge the necessity of homogenizing distinct data types by relying on model-specific loss calibration and initialization schemes.To further address the high heterogeneity in mixed-type tabular data, we introduce adaptive feature- or type-specific noise schedules. These ensure balanced generative performance across features and optimize the allocation of model capacity across features and diffusion time. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts sample quality. Replication code is available at https://github.com/muellermarkus/cdtd.

View on arXiv PDF Code

Similar