Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets
This work addresses the problem of improving safety and efficiency in constrained reinforcement learning for safety-critical applications, representing an incremental advancement with a novel method for a known bottleneck.
The paper tackles the challenge of balancing task performance and constraint satisfaction in constrained reinforcement learning, which often leads to over-conservative or violating local minima, by proposing Adversarial Constrained Policy Optimization (ACPO) that adapts cost budgets during training, achieving better performance than baselines in Safety Gymnasium and quadruped locomotion tasks.
Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right balance between task performance and constraint satisfaction and it is prone for them to get stuck in over-conservative or constraint violating local minima. In this paper, we propose Adversarial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training. Our approach divides original constrained problem into two adversarial stages that are solved alternately, and the policy update performance of our algorithm can be theoretically guaranteed. We validate our method through experiments conducted on Safety Gymnasium and quadruped locomotion tasks. Results demonstrate that our algorithm achieves better performances compared to commonly used baselines.