Safe Policy Improvement by Minimizing Robust Baseline Regret
This work addresses the challenge of safe policy improvement for decision-making systems with limited data, though it is incremental as it builds on existing model-based methods.
The paper tackles the problem of computing a safe policy in sequential decision-making by minimizing robust baseline regret using an inaccurate dynamics model, showing that their approximate algorithm significantly outperforms standard approaches in empirical results.
An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.