Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
This work addresses a gap in bilevel optimization for dynamic RL problems, offering a novel solution for researchers and practitioners in machine learning, though it appears incremental in extending existing methods to new domains.
The paper tackles the challenge of applying bilevel optimization to dynamic objective functions in reinforcement learning tasks, such as RLHF and incentive design, by introducing a principled penalty-based algorithmic framework and demonstrating its effectiveness through simulations.
Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.