CLMay 18, 2025

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

arXiv:2505.12212v324 citationsh-index: 11Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the challenge of resource-intensive data selection for LLM fine-tuning, offering a more efficient solution for practitioners, though it is incremental as it builds on existing attention-based and in-context learning techniques.

The paper tackles the problem of efficiently selecting optimal data subsets for fine-tuning large language models (LLMs) to balance performance and computational costs, proposing Data Whisperer, which achieves superior performance using just 10% of the data with a 3.1-point improvement and 7.4x speedup compared to existing methods.

Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup. The code is available at https://github.com/gszfwsb/Data-Whisperer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes