CLMay 30, 2025

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Haoyu Li, Xuhong Li, Yiming Dong, Kun Liu

arXiv:2505.24768v12 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses dataset construction for language model developers, offering actionable insights, but it is incremental as it builds on existing diversity concepts with a novel analysis.

The paper tackled the problem of dataset diversity in language model fine-tuning by systematically analyzing diversity-control strategies at macro, meso, and micro levels, finding that microscopic diversity in responses correlates most strongly with model performance and yields superior results with maximum diversity.

Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.

View on arXiv PDF

Similar