Less is More: Improving LLM Alignment via Preference Data Selection
This work addresses data efficiency and noise issues in preference optimization for LLM alignment, offering incremental improvements over existing DPO methods.
The paper tackles the problem of noisy data in Direct Preference Optimization (DPO) for aligning large language models by proposing a data selection strategy based on margin-maximization and Bayesian aggregation, achieving 3% to 8% improvements on the AlpacaEval2 benchmark using only 10% of the Ultrafeedback dataset.
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.