LGAIROJul 1, 2025

Residual Reward Models for Preference-based Reinforcement Learning

arXiv:2507.00611v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of slow convergence in PbRL for robotics and simulation tasks, offering an incremental improvement by effectively leveraging prior knowledge to enhance training efficiency.

The paper tackles slow convergence in Preference-based Reinforcement Learning (PbRL) by proposing a Residual Reward Model (RRM) that splits the true reward into a prior reward and a learned reward, trained with preferences. Experimental results on Meta-World tasks and a real robot show that RRM substantially improves performance and accelerates policy learning, achieving success in fewer steps than baselines.

Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at https://sunlighted.github.io/RRM-web/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes