LGCLCRDec 3, 2024

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

arXiv:2412.02467v25 citationsh-index: 8Has CodeTrans. Mach. Learn. Res.
AI Analysis

This addresses the problem of privacy-preserving data generation for machine learning practitioners, offering an incremental improvement in DP tabular data synthesis.

The paper tackles the challenge of generating synthetic tabular data under differential privacy (DP) constraints by proposing DP-2Stage, a two-stage fine-tuning framework for language models, which improves performance over direct DP fine-tuning across various settings and metrics.

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes