CLJun 25, 2024

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

arXiv:2406.17312v229 citations
Originality Incremental advance
AI Analysis

This work addresses annotation efficiency for researchers and practitioners using iterative preference learning, though it is incremental as it builds on existing methods like DPO.

The paper tackles the problem of expensive annotation requirements in iterative preference learning by developing a strategy to select which response pairs are most valuable to annotate, achieving competitive or better performance than random selection while reducing annotation costs.

Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes