CLDec 6, 2024

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang

arXiv:2412.04871v112.619 citationsh-index: 4Has CodeCOLING

Originality Incremental advance

AI Analysis

This addresses the problem of expensive dataset preparation for researchers and practitioners specializing LLMs, though it is incremental as it builds on existing data augmentation and distillation techniques.

The paper tackles the high cost of dataset construction for domain-specific LLM fine-tuning by introducing a family of data augmentation models that support instruction expansion, refinement, and pair expansion, reducing inference costs and improving efficiency.

Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.

View on arXiv PDF

Similar