CLAIDec 19, 2024

Language Models as Continuous Self-Evolving Data Engineers

arXiv:2412.15151v35 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This addresses the reliance on expert-labeled data for LLM training, potentially reducing time and costs while aligning data with human preferences, though it appears incremental in automating existing data engineering processes.

The paper tackles the problem of limited high-quality training data for large language models by proposing LANCE, a paradigm where LLMs autonomously generate, clean, review, and annotate data, resulting in average score enhancements of 3.64 for Qwen2-7B and 1.75 for Qwen2-7B-Instruct across benchmarks.

Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting a ceiling on the performance of LLMs. To address this issue, we propose a novel paradigm named LANCE (LANguage models as Continuous self-Evolving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of 3.64 for Qwen2-7B and 1.75 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities. Codes are available at: https://github.com/Control-derek/LANCE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes