CL AIDec 13, 2024

Small Language Model as Data Prospector for Large Language Model

Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min Yang

arXiv:2412.09990v13.43 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the resource-intensive data selection process for LLM fine-tuning, offering a more efficient solution for practitioners, though it is incremental as it builds directly on prior work.

The paper tackles the problem of efficiently selecting high-quality instruction data for fine-tuning Large Language Models (LLMs) by proposing SuperNUGGETS, an improved variant of NUGGETS that uses a small language model (SLM) for filtering, resulting in only a 1-2% performance decrease but a 58x increase in efficiency compared to the original method.

The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed \texttt{NUGGETS}, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved variant of \texttt{NUGGETS} optimised for efficiency and performance. Our \texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to \texttt{NUGGETS}, but the efficiency can be increased by a factor of 58. Compared to the original \texttt{NUGGETS}, our \texttt{SuperNUGGETS} has a higher utility value due to the significantly lower resource consumption.

View on arXiv PDF

Similar