Finding Safe Zones of policies Markov Decision Processes
This addresses the problem of ensuring policy safety in reinforcement learning for practitioners, but it is incremental as it builds on existing MDP frameworks with a new computational approach.
The paper tackles the problem of finding SafeZones in Markov Decision Processes, which are subsets of states where most policy trajectories remain, and shows that optimal SafeZones are computationally hard to find, but provides a bi-criteria approximation algorithm with almost a factor of 2 for both escape probability and size using polynomial sample complexity.
Given a policy of a Markov Decision Process, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of a SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general, the problem is computationally hard. Our main result is a bi-criteria approximation learning algorithm with a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity.