LG AIJul 25, 2025

Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci

arXiv:2507.19334v116.96 citationsh-index: 36EMNLP

Originality Incremental advance

AI Analysis

This work addresses the scarcity of tabular data for domains like healthcare or finance, offering a fast and accurate augmentation method, though it is incremental as it builds on existing generative approaches.

The paper tackled the problem of generating high-quality tabular data efficiently by addressing limitations in existing LLM-based methods, such as bias from dense dependencies and high computational costs, and achieved a 4% reduction in constraint violations and a 9,500x speedup in generation.

Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.

View on arXiv PDF

Similar