Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
This work addresses a theoretical gap for researchers in RLHF, offering insights into efficient data utilization, though it is incremental as it builds on existing algorithms like PC-PG.
The paper tackles the lack of theoretical justification for why Reinforcement Learning from Human Feedback (RLHF) works well with limited human feedback, by analyzing a policy optimization-based RLHF algorithm and providing performance bounds with low query complexity, showing that small amounts of feedback can suffice.
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the reward estimation error when comparison feedback (rather than numerical reward observation) is given. We provide and analyze algorithms PG-RLHF and NN-PG-RLHF for two settings: linear and neural function approximation, respectively.