Systematic Assessment of Tabular Data Synthesis
This work addresses the problem of assessing privacy-preserving data synthesis methods for researchers and practitioners, but it is incremental as it builds on existing evaluation approaches.
The paper tackled the lack of comprehensive evaluation for tabular data synthesis algorithms by developing a systematic framework with new metrics for fidelity, privacy, and utility, and found interesting insights from evaluating 8 synthesizers on 12 datasets.
Data synthesis has been advocated as an important approach for utilizing data while protecting data privacy. In recent years, a plethora of tabular data synthesis algorithms (i.e., synthesizers) have been proposed. Some synthesizers satisfy Differential Privacy, while others aim to provide privacy in a heuristic fashion. A comprehensive understanding of the strengths and weaknesses of these synthesizers remains elusive due to drawbacks in evaluation metrics and missing head-to-head comparisons of newly developed synthesizers that take advantage of diffusion models and large language models with state-of-the-art statistical synthesizers. In this paper, we present a systematic evaluation framework for assessing tabular data synthesis algorithms. Specifically, we examine and critique existing evaluation metrics, and introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. We conducted extensive evaluations of 8 different types of synthesizers on 12 real-world datasets and identified some interesting findings, which offer new directions for privacy-preserving data synthesis.