LGMay 23, 2022

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, Liwei Wang

arXiv:2205.11140v229.394 citationsh-index: 48

Originality Highly original

AI Analysis

This work addresses the challenge of making reinforcement learning more practical and interpretable for real-world applications by enabling agents to learn from human preferences, which is incremental as it builds on existing PbRL methods but extends them to general function approximation with theoretical guarantees.

The paper tackles the problem of preference-based reinforcement learning (PbRL) with general function approximation, where an agent learns from human preferences over trajectories instead of numeric rewards, and achieves a near-optimal regret bound of $ ilde{O}(\operatorname{poly}(d H) \sqrt{K})$ using an optimistic model-based algorithm. It also extends PbRL to $n$-wise comparisons with a sample-efficient algorithm, providing the first theoretical result for PbRL beyond tabular cases.

We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.

View on arXiv PDF

Similar