MLLGOct 26, 2023

Demonstration-Regularized RL

arXiv:2310.17303v28 citationsh-index: 43
Originality Incremental advance
AI Analysis

It addresses the sample inefficiency problem in RL for researchers and practitioners by offering theoretical guarantees for demonstration-regularized methods, though it is incremental as it builds on existing KL-regularization approaches.

This paper tackles the problem of improving sample efficiency in reinforcement learning (RL) by incorporating expert demonstrations, and it provides theoretical sample complexity bounds showing that using N^E demonstrations reduces complexity to order O~(Poly(S,A,H)/(ε^2 N^E)) in finite MDPs and O~(Poly(d,H)/(ε^2 N^E)) in linear MDPs, with applications to RL from human feedback (RLHF).

Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes