LG MLJul 22, 2025

Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

arXiv:2507.17066v17.11 citations

Originality Incremental advance

AI Analysis

This addresses privacy risks for data practitioners using synthetic data in sensitive domains like health and finance, though it is incremental as it benchmarks existing models.

The paper tackled the problem of privacy leakage in synthetic tabular data generation using foundation models in low-data settings, finding that models like LLaMA 3.3 70B have up to 54 percentage points higher true-positive rate at 1% FPR than safer baselines, and identified prompt tweaks that reduce leakage by up to 39 points while maintaining fidelity.

Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three foundation models (GPT-4o-mini, LLaMA 3.3 70B, TabPFN v2) against four baselines on 35 real-world tables from health, finance, and policy. We evaluate statistical fidelity, downstream utility, and membership inference leakage. Results show foundation models consistently have the highest privacy risk. LLaMA 3.3 70B reaches up to 54 percentage points higher true-positive rate at 1% FPR than the safest baseline. GPT-4o-mini and TabPFN are also highly vulnerable. We plot the privacy-utility frontier and show that CTGAN and GPT-4o-mini offer better tradeoffs. A factorial study finds that three zero-cost prompt tweaks-small batch size, low temperature, and using summary statistics-can reduce worst-case AUC by 14 points and rare-class leakage by up to 39 points while maintaining over 90% fidelity. Our benchmark offers a practical guide for safer low-data synthesis with foundation models.

View on arXiv PDF

Similar