Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints
This work addresses safety in reinforcement learning for applications where constraints depend on trajectory history, which is incremental as it extends existing safe RL methods to handle non-Markovian cases.
The paper tackles the problem of safe reinforcement learning with non-Markovian safety constraints, where safety labels are associated with trajectories rather than immediate states, and demonstrates that their approach is highly scalable and effectively satisfies these constraints.
In safe Reinforcement Learning (RL), safety cost is typically defined as a function dependent on the immediate state and actions. In practice, safety constraints can often be non-Markovian due to the insufficient fidelity of state representation, and safety cost may not be known. We therefore address a general setting where safety labels (e.g., safe or unsafe) are associated with state-action trajectories. Our key contributions are: first, we design a safety model that specifically performs credit assignment to assess contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset. Second, using RL-as-inference strategy we derive an effective algorithm for optimizing a safe policy using the learned safety model. Finally, we devise a method to dynamically adapt the tradeoff coefficient between reward maximization and safety compliance. We rewrite the constrained optimization problem into its dual problem and derive a gradient-based method to dynamically adjust the tradeoff coefficient during training. Our empirical results demonstrate that this approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints.