RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences
This work addresses the robustness issue in PbRL for robotics applications, offering an incremental improvement over existing methods.
The paper tackles the problem of preference-based reinforcement learning (PbRL) being sensitive to noisy human preferences by introducing RIME, a robust algorithm that uses a sample selection discriminator and warm start for the reward model, achieving significant improvements in robustness over state-of-the-art methods on robotic manipulation and locomotion tasks.
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.