LGOct 31, 2024

RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

arXiv:2410.23569v47 citationsh-index: 7Has CodeNIPS
Originality Incremental advance
AI Analysis

This work addresses safety-critical applications where ignoring risk can lead to harmful outcomes, though it is incremental as it adapts existing risk-aware methods to the PbRL framework.

The paper tackles the lack of risk-aware objectives in Preference-based Reinforcement Learning (PbRL) for scenarios like AI safety and healthcare, introducing RA-PbRL to optimize nested and static quantile risk objectives with proven sublinear regret bounds and empirical validation.

Reinforcement Learning from Human Feedback (RLHF) has recently surged in popularity, particularly for aligning large language models and other AI systems with human intentions. At its core, RLHF can be viewed as a specialized instance of Preference-based Reinforcement Learning (PbRL), where the preferences specifically originate from human judgments rather than arbitrary evaluators. Despite this connection, most existing approaches in both RLHF and PbRL primarily focus on optimizing a mean reward objective, neglecting scenarios that necessitate risk-awareness, such as AI safety, healthcare, and autonomous driving. These scenarios often operate under a one-episode-reward setting, which makes conventional risk-sensitive objectives inapplicable. To address this, we explore and prove the applicability of two risk-aware objectives to PbRL : nested and static quantile risk objectives. We also introduce Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives. Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings. Our code is available in https://github.com/aguilarjose11/PbRLNeurips.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes