FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only
This addresses the problem of expensive data annotation for researchers and practitioners in AI, offering a free alternative, though it is incremental as it builds on existing instruction fine-tuning methods.
The paper tackles the high cost and labor of annotating instruction datasets for fine-tuning large language models by introducing FANNO, an autonomous open-source framework that generates diverse, high-quality data comparable to human-annotated datasets like Alpaca-GPT4-Cleaned, as shown in experiments on benchmarks such as Open LLM Leaderboard and AlpacaEval.
Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.