High-Confidence Off-Policy (or Counterfactual) Variance Estimation
This work addresses the need for high-confidence guarantees on policy variance in sequential decision-making systems, particularly for high-risk domains, though it appears incremental as it extends prior work on expected return estimation to variance estimation.
The paper tackles the problem of estimating and bounding the variance of returns from off-policy data with high confidence, which is critical for high-risk applications, by providing a method to address this previously open issue.
Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-confidence guarantees on the new policy's behavior are provided before deployment, to ensure that the policy will behave as desired. Prior works have studied high-confidence off-policy estimation of the expected return, however, high-confidence off-policy estimation of the variance of returns can be equally critical for high-risk applications. In this paper, we tackle the previously open problem of estimating and bounding, with high confidence, the variance of returns from off-policy data