LGOct 25, 2023

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

arXiv:2310.16981v136 citationsh-index: 74
Originality Synthesis-oriented
AI Analysis

It addresses the problem of improving synthetic data quality for machine learning practitioners when real data is limited, though it is incremental by focusing on benchmarking and recommendations rather than a new method.

This paper tackles the challenge of generating synthetic tabular data that accurately reflects real-world complexities by integrating data-centric AI techniques to guide the process, and it benchmarks five state-of-the-art models on eleven datasets to provide insights and recommendations.

Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes