Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
For offline RL practitioners, this work offers a principled way to enforce safety constraints without environment interaction, addressing a key bottleneck in deploying RL in safety-critical domains.
The paper integrates safe policy improvement with shielding for offline RL, providing high-probability safety guarantees while improving average and worst-case performance, especially in low-data regimes.
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.