Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
This addresses the challenge of data scarcity in domain-specific databases for applications in data management, though it is incremental as it builds on existing LLM and interpretable model techniques.
The paper tackled the problem of generating synthetic tabular data in low-data regimes, where existing methods like GANs and LLMs often fail due to insufficient reference data or poor feature-label dependencies. The proposed ReFine framework achieved up to 0.44 absolute improvement in R-squared for regression and 10.0% relative improvement in F1 score for classification, outperforming state-of-the-art methods.
Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic "if-then" rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks.