CLApr 22, 2024

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

CMU

arXiv:2404.14361v321.254 citationsh-index: 14Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the challenge of data scarcity for NLP practitioners by providing a more effective way to repurpose public datasets, though it is incremental as it builds on existing synthetic data generation approaches.

The paper tackles the problem of generating high-quality synthetic training data for NLP tasks by introducing DataTune, a method that transforms existing datasets to align with target tasks, resulting in a 49% improvement over few-shot prompting and a 34% improvement over existing synthetic or retrieved data methods on BIG-Bench tasks.

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

View on arXiv PDF Code

Similar