Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
This addresses the need for high-quality synthetic data to improve LLMs without expensive human annotations, though it is incremental as it builds on existing synthetic data generation methods.
The paper tackles the problem of low-quality synthetic data for enhancing large language models by introducing Source2Synth, a scalable method that generates and curates synthetic data grounded in real-world sources, resulting in performance improvements of 25.51% on WikiSQL and 22.57% on HotpotQA compared to baselines.
Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data examples with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of data: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotpotQA compared to the fine-tuned baselines.