LG AISep 20, 2024

Tabular Data Generation using Binary Diffusion

arXiv:2409.13882v24.64 citationsh-index: 5Has Code

Originality Highly original

AI Analysis

This addresses the challenge of limited or sensitive real tabular data for machine learning practitioners, offering a more efficient solution without extensive preprocessing or large models.

The paper tackles the problem of generating synthetic tabular data by introducing a novel binary transformation method and Binary Diffusion model, which outperforms state-of-the-art models on benchmarks like Travel, Adult Income, and Diabetes datasets while being smaller in size.

Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size. Code and models are available at: https://github.com/vkinakh/binary-diffusion-tabular

View on arXiv PDF Code

Similar