LG AIJun 9, 2022

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu

Tsinghua

arXiv:2206.04436v222.169 citationsh-index: 36Has Code

Originality Highly original

AI Analysis

This work addresses safety issues in reinforcement learning for applications like robotics, though it is incremental as it builds on existing risk-sensitive methods.

The paper tackled the problem of catastrophic failures in deep reinforcement learning due to uncertainty in transition and observation disturbances by proposing a novel algorithm, CPPO, which uses conditional value-at-risk to constrain risk and improve robustness. Experimental results showed that CPPO achieved higher cumulative rewards and greater robustness against both types of disturbances in MuJoCo tasks.

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.

View on arXiv PDF Code

Similar