LGAISep 20, 2024

Tabular Data Generation using Binary Diffusion

arXiv:2409.13882v24 citationsh-index: 5Has Code
Originality Highly original
AI Analysis

This addresses the challenge of limited or sensitive real tabular data for machine learning practitioners, offering a more efficient solution without extensive preprocessing or large models.

The paper tackles the problem of generating synthetic tabular data by introducing a novel binary transformation method and Binary Diffusion model, which outperforms state-of-the-art models on benchmarks like Travel, Adult Income, and Diabetes datasets while being smaller in size.

Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size. Code and models are available at: https://github.com/vkinakh/binary-diffusion-tabular

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes