LGJan 31, 2025

Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

arXiv:2501.18873v37.11 citationsh-index: 4

Originality Highly original

AI Analysis

This addresses the challenge of robust policy optimization in generative AI, particularly for post-training alignment, though it is incremental as it builds on existing PbRL methods.

The paper tackles the problem of identifying the best policy in preference-based reinforcement learning (PbRL) for aligning generative models, proposing a novel algorithm called PSPL that achieves the first Bayesian simple regret guarantees and outperforms existing baselines on benchmarks.

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

View on arXiv PDF

Similar