Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective
This addresses safety in reinforcement learning for domains like robotics, though it is incremental as it builds on existing inverse constraint learning methods.
The paper tackles the problem of learning safe policies from expert demonstrations in constrained Markov Decision Processes with unknown constraints, by developing SafeQIL, a safe Q-learning algorithm that balances reward maximization and safety. It shows competitive performance against state-of-the-art methods on benchmark tasks.
Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states' safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.