LGDec 31, 2025

Efficiently Estimating Data Efficiency for Language Model Fine-tuning

arXiv:2512.24991v11 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the costly issue of incremental annotation and retraining for fine-tuning language models, though it is incremental as it builds on existing fine-tuning practices.

The paper tackles the problem of predicting how many fine-tuning examples are needed for language models to achieve desired performance, proposing a method that uses gradient cosine similarity of low-confidence examples to estimate data efficiency with 8.6% error, saving hundreds of annotations per task.

While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task's data efficiency--i.e., the number of fine-tuning examples needed to achieve a desired level of performance--is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task's data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task's data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We validate our approach on a diverse set of tasks with varying data efficiencies, attaining 8.6% error in overall data efficiency prediction and typically eliminating hundreds of unnecessary annotations on each task. Our experiment results and implementation code are available on GitHub.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes