Optimal Transport-Guided Safety in Temporal Difference Reinforcement Learning
This work addresses safety concerns in reinforcement learning for decision-making agents, offering a novel approach to reduce unsafe behavior in stochastic environments, though it appears incremental in integrating existing theories.
The paper tackles the problem of safety in reinforcement learning by introducing a temporal difference algorithm that uses optimal transport theory to quantify action uncertainty, encouraging safer behavior; it theoretically proves a reduction in unsafe state visits and demonstrates safer performance while maintaining effectiveness in case studies.
The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance, frequently without considering safety. In contrast, safe reinforcement learning seeks to reduce or avoid unsafe behavior. This paper views safety as taking actions with more predictable consequences under environment stochasticity and introduces a temporal difference algorithm that uses optimal transport theory to quantify the uncertainty associated with actions. By integrating this uncertainty score into the decision-making objective, the agent is encouraged to favor actions with more predictable outcomes. We theoretically prove that our algorithm leads to a reduction in the probability of visiting unsafe states. We evaluate the proposed algorithm on several case studies in the presence of various forms of environment uncertainty. The results demonstrate that our method not only provides safer behavior but also maintains the performance. A Python implementation of our algorithm is available at \href{https://github.com/SAILRIT/Risk-averse-TD-Learning}{https://github.com/SAILRIT/OT-guided-TD-Learning}.