LGJan 26, 2024

Off-Policy Primal-Dual Safe Reinforcement Learning

Zifan Wu, Bo Tang, Qian Lin, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

arXiv:2401.14758v213.414 citationsh-index: 51Has CodeICLR

Originality Incremental advance

AI Analysis

This work addresses safety constraint violations in off-policy safe RL, which is crucial for real-world applications like robotics and autonomous systems, though it is incremental as it builds on existing primal-dual methods.

The paper tackles the problem of cumulative cost estimation errors in off-policy primal-dual safe reinforcement learning, which can lead to safety constraint violations, by proposing conservative policy optimization and local policy convexification to improve constraint satisfaction and reduce suboptimality, achieving comparable asymptotic performance to state-of-the-art on-policy methods with fewer samples and significantly reducing constraint violations during training.

Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal and dual update processes. We show that this problem causes significant underestimation of cost when using off-policy methods, leading to the failure to satisfy the safety constraint. To address this issue, we propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in cost estimation. This improves constraint satisfaction but also potentially hinders reward maximization. We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty. We provide theoretical interpretations of the joint coupling effect of these two ingredients and further verify them by extensive experiments. Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/ZifanWu/CAL.

View on arXiv PDF Code

Similar