LG AIApr 3, 2025

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

arXiv:2504.03040v14.1h-index: 2

Originality Incremental advance

AI Analysis

This addresses safety concerns in reinforcement learning for real-world applications, but it appears incremental as it builds on existing policy optimization frameworks with a novel modulation approach.

The paper tackles the problem of training reinforcement learning agents to maximize performance while adhering to safety constraints by proposing Safety Modulated Policy Optimization (SMPO), which modulates rewards based on estimated safety costs, and experimental results show it outperforms classic and state-of-the-art methods in safe RL performance.

Safe Reinforcement Learning (Safe RL) aims to train an RL agent to maximize its performance in real-world environments while adhering to safety constraints, as exceeding safety violation limits can result in severe consequences. In this paper, we propose a novel safe RL approach called Safety Modulated Policy Optimization (SMPO), which enables safe policy function learning within the standard policy optimization framework through safety modulated rewards. In particular, we consider safety violation costs as feedback from the RL environments that are parallel to the standard awards, and introduce a Q-cost function as safety critic to estimate expected future cumulative costs. Then we propose to modulate the rewards using a cost-aware weighting function, which is carefully designed to ensure the safety limits based on the estimation of the safety critic, while maximizing the expected rewards. The policy function and the safety critic are simultaneously learned through gradient descent during online interactions with the environment. We conduct experiments using multiple RL environments and the experimental results demonstrate that our method outperforms several classic and state-of-the-art comparison methods in terms of overall safe RL performance.

View on arXiv PDF

Similar