LGNov 8, 2021

Dueling RL: Reinforcement Learning with Trajectory Preferences

Aldo Pacchiano, Aadirupa Saha, Jonathan Lee

arXiv:2111.04850v328.4115 citations

Originality Incremental advance

AI Analysis

This addresses the challenge for RL practitioners who struggle to design accurate reward functions, offering a formal framework with theoretical guarantees, though it is incremental in extending preference-based methods to non-Markovian settings.

The paper tackles the problem of preference-based reinforcement learning (PbRL) by learning from binary trajectory preferences instead of hand-crafted reward functions, achieving near-optimal regret guarantees of ̃O(SH d log(T/δ)√T) with known transitions and ̃O((√d + H^2 + |S|)√dT + √|S||A|TH) with unknown transitions.

We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / δ) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.

View on arXiv PDF

Similar