LGApr 8, 2025

SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL

arXiv:2504.06386v22 citationsh-index: 3IJCAI
Originality Incremental advance
AI Analysis

This addresses safety concerns for RL in critical domains like robotics or autonomous systems, offering a method to certify policies during training and deployment, though it builds on existing projection and scenario-based techniques.

The paper tackles the problem of providing safety guarantees for reinforcement learning in safety-critical applications by introducing SPoRt, which bounds the probability of violating safety properties for task-specific policies, enabling a trade-off between safety and performance with experimental validation.

To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work, we present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting. This bound, based on a maximum policy ratio computed with respect to a 'safe' base policy, can also be applied to temporally-extended properties (beyond safety) and to robust control problems. To utilize these results, we introduce SPoRt, which provides a data-driven method for computing this bound for the base policy using the scenario approach, and includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. SPoRt thus enables users to trade off safety guarantees against task-specific performance. Complementing our theoretical results, we present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes