CLAISep 17, 2024

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Stanford
arXiv:2409.11378v113 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient data selection for fine-tuning LLMs, offering a novel approach that improves performance across multiple domains, though it is incremental relative to existing sampling methods.

The paper tackles the problem of selecting optimal data subsets for fine-tuning large language models by proposing a diversity-centric method using k-means clustering and iterative refinement, resulting in a 7% improvement over random selection and a 3.8% gain over state-of-the-art methods across various tasks.

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes