LGAIMay 25, 2025

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

arXiv:2505.19241v18 citationsh-index: 39
Originality Incremental advance
AI Analysis

This addresses the resource-intensive challenge of collecting high-quality preference datasets for LLM alignment, with incremental improvements in active data selection efficiency.

The paper tackles the problem of costly human preference annotation for aligning large language models by proposing ActiveDPO, an algorithm that uses a theoretically grounded data selection criterion for non-linear reward functions, leading to more effective and efficient data collection as shown in experiments outperforming existing methods.

The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes