LGJan 31, 2025

Best Policy Learning from Trajectory Preference Feedback

arXiv:2501.18873v31 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the challenge of robust policy optimization in generative AI, particularly for post-training alignment, though it is incremental as it builds on existing PbRL methods.

The paper tackles the problem of identifying the best policy in preference-based reinforcement learning (PbRL) for aligning generative models, proposing a novel algorithm called PSPL that achieves the first Bayesian simple regret guarantees and outperforms existing baselines on benchmarks.

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes