GenQA: Generating Millions of Instructions from a Handful of Prompts
This provides a scalable solution for researchers and practitioners needing industrial-scale instruction datasets to study finetuning techniques, though it is incremental in automating data generation.
The authors tackled the problem of generating large-scale instruction datasets for finetuning LLMs by automating the creation of millions of diverse examples from a few prompts, resulting in a dataset that matches or outperforms WizardLM and Ultrachat on knowledge and conversational tasks when used to finetune a Llama-3 8B model.
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.