LG AI IT OC MLDec 27, 2024

Comparing Few to Rank Many: Active Human Preference Learning using Randomized Frank-Wolfe

Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, Branislav Kveton

arXiv:2412.19396v19.25 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently eliciting human preferences in applications like reinforcement learning from human feedback, though it appears incremental as it builds on existing D-optimal design and Frank-Wolfe methods.

The paper tackles the problem of learning human preferences from limited comparison feedback by formulating it as learning a Plackett-Luce model over many choices from K-way comparisons, where K is much smaller than N, and proposes a randomized Frank-Wolfe algorithm to efficiently solve the D-optimal design for this objective, with empirical evaluation on synthetic and NLP datasets.

We study learning of human preferences from a limited comparison feedback. This task is ubiquitous in machine learning. Its applications such as reinforcement learning from human feedback, have been transformational. We formulate this problem as learning a Plackett-Luce model over a universe of $N$ choices from $K$-way comparison feedback, where typically $K \ll N$. Our solution is the D-optimal design for the Plackett-Luce objective. The design defines a data logging policy that elicits comparison feedback for a small collection of optimally chosen points from all ${N \choose K}$ feasible subsets. The main algorithmic challenge in this work is that even fast methods for solving D-optimal designs would have $O({N \choose K})$ time complexity. To address this issue, we propose a randomized Frank-Wolfe (FW) algorithm that solves the linear maximization sub-problems in the FW method on randomly chosen variables. We analyze the algorithm, and evaluate it empirically on synthetic and open-source NLP datasets.

View on arXiv PDF

Similar