LG AIDec 15, 2024

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Juntao Dai, Yaodong Yang, Qian Zheng, Gang Pan

arXiv:2412.11138v16.43 citationsh-index: 16ICML

Originality Highly original

AI Analysis

This addresses safety-critical applications in robotics and autonomous systems by providing a novel estimation method for finite-horizon constraints, though it is incremental as it builds on existing Safe RL frameworks.

The paper tackled the problem of safety violations in finite-horizon Safe Reinforcement Learning by proposing a Gradient-based Estimation (GBE) method, which effectively estimates constraint changes and led to the development of the CGPO algorithm that ensures safe policy updates.

A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.

View on arXiv PDF

Similar