Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics
This work addresses the challenge of policy mismatch in constrained RL for researchers and practitioners, offering a more direct convergence approach, though it appears incremental as it builds on existing saddle-flow dynamics.
The paper tackled the problem of constrained reinforcement learning, where an agent must maximize cumulative reward while meeting secondary constraints, by proposing a novel algorithm based on dissipative saddle flow dynamics that converges almost surely to the optimal policy, eliminating the mismatch between behavioral and optimal policies found in prior methods.
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward while satisfying minimum requirements in secondary cumulative reward constraints. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. However, such methods are based on stochastic gradient descent ascent algorithms whose trajectories are connected to the optimal policy only after a mixing output stage that depends on the algorithm's history. As a result, there is a mismatch between the behavioral policy and the optimal one. In this work, we propose a novel algorithm for constrained RL that does not suffer from these limitations. Leveraging recent results on regularized saddle-flow dynamics, we develop a novel stochastic gradient descent-ascent algorithm whose trajectories converge to the optimal policy almost surely.