LGApr 7, 2025

TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

Jacob Si, Zijing Ou, Mike Qu, Zhengrui Xiang, Yingzhen Li

arXiv:2504.04798v47 citationsh-index: 9Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work solves the problem of generating high-quality synthetic tabular data for applications requiring privacy and efficiency, representing an incremental improvement over existing tabular diffusion methods.

The paper tackles the challenge of training tabular diffusion models by introducing TabRep, a unified continuous representation that addresses issues with multi-modal distributions and suboptimal encoding, resulting in superior performance that exceeds the downstream quality of original datasets while preserving privacy and computational efficiency.

Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.

View on arXiv PDF

Similar