CLApr 22, 2024

Better Synthetic Data by Retrieving and Transforming Existing Datasets

CMU
arXiv:2404.14361v354 citationsh-index: 14Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the challenge of data scarcity for NLP practitioners by providing a more effective way to repurpose public datasets, though it is incremental as it builds on existing synthetic data generation approaches.

The paper tackles the problem of generating high-quality synthetic training data for NLP tasks by introducing DataTune, a method that transforms existing datasets to align with target tasks, resulting in a 49% improvement over few-shot prompting and a 34% improvement over existing synthetic or retrieved data methods on BIG-Bench tasks.

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes