Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
For practitioners generating synthetic tabular data from small datasets, DiffICL offers a way to obtain high-quality synthetic data without sacrificing privacy.
Existing tabular generative models suffer from a quality-privacy tradeoff in the small-data regime, where improving data quality increases memorization of training samples. DiffICL uses in-context learning with pretrained structural priors to break this tradeoff, achieving both higher data quality and better privacy on 14 real-world datasets.
Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.