Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
This work addresses the challenge of prohibitive environment interactions for reinforcement learning practitioners, offering an incremental improvement in policy evaluation efficiency.
The paper tackles the problem of inefficient policy evaluation in reinforcement learning by proposing novel methods to improve data efficiency of online Monte Carlo estimators while maintaining unbiasedness, achieving better empirical performance with fewer offline data requirements.
Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.