LGAIJul 21, 2025

FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

arXiv:2507.15839v1
Originality Incremental advance
AI Analysis

This addresses the need for scalable, cost-effective synthetic data generation for researchers and practitioners, though it is incremental as it builds on existing LLM-based approaches.

The paper tackles the problem of high time and cost in generating synthetic tabular data using LLMs by proposing a method that encodes field distributions into reusable sampling scripts, reducing the burden and outperforming direct methods in diversity and realism.

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes