Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
This addresses the high cost of expert-curated datasets for code generation tasks, offering a scalable solution for LLM alignment, though it is incremental as it builds on existing synthetic generation approaches.
The paper tackles the problem of generating high-quality coding instruction data for aligning large language models (LLMs) by introducing Genetic-Instruct, a scalable algorithm that uses evolutionary principles to synthesize over 7.5 million instruction-code pairs, resulting in significant improvements in LLM code generation capabilities compared to other methods.
Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.