PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data
This work addresses the need for reliable uncertainty estimates in high-stakes applications like healthcare, though it is incremental by building on existing OPE and data augmentation techniques.
The paper tackles the problem of unreliable uncertainty quantification in off-policy evaluation (OPE) when using biased auxiliary datasets, proposing two methods to construct valid confidence intervals for policy value estimates. The results show that these methods consistently cover ground truth values across simulators and a real healthcare dataset, unlike prior approaches.
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^π(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.