Towards Synthesizing High-Dimensional Tabular Data with Limited Samples
This addresses a critical bottleneck for data scientists and ML practitioners who need synthetic tabular data for privacy or augmentation in high-dimensional, low-sample scenarios, representing a strong domain-specific advancement.
The paper tackles the problem of synthesizing high-dimensional tabular data with limited samples, where existing diffusion models degenerate, and proposes CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples during training to improve robustness. Experimental results show CtrTab outperforms state-of-the-art models with an average performance gap in accuracy over 90%.
Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model's sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.