CLJun 30, 2025

TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

arXiv:2506.23979v12.71 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited and English-centric preference datasets for LLM fine-tuning, offering a scalable solution for researchers and practitioners, though it is incremental in automating dataset generation.

The paper tackles the resource-intensive challenge of constructing high-quality preference datasets for fine-tuning large language models (LLMs) across languages by proposing the TaP framework, which automates and scales dataset generation; results show that LLMs trained on TaP-generated datasets outperform those on existing open-source datasets, even surpassing performance with a dataset 180 times larger.

Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

View on arXiv PDF

Similar