LGAICLFeb 16, 2024

Active Preference Optimization for Sample Efficient RLHF

arXiv:2402.10500v362 citationsh-index: 19ECML/PKDD
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck in deploying aligned LLMs by reducing data collection costs, though it is an incremental improvement over existing methods.

The paper tackles the problem of costly human preference collection in RLHF for LLM alignment by showing that uniform context sampling leads to constant sub-optimality, and proposes an active learning algorithm (APO) that achieves a sub-optimality gap matching a lower bound up to log factors, with experiments validating its sample efficiency.

Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a $d$-dimensional hypercube and the number of samples is $T$, we show an $Ω(d/\sqrt{T})$ lower bound. Next, we propose an algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via $\texttt{APO}$ matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate $\texttt{APO}$'s efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes