ML LGSep 23, 2023

Distributional Shift-Aware Off-Policy Interval Estimation: A Unified Error Quantification Framework

Wenzhuo Zhou, Yuhan Li, Ruoqing Zhu, Annie Qu

arXiv:2309.13278v210.87 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work provides a robust method for high-confidence policy evaluation in offline reinforcement learning, which is incremental but addresses key challenges in error trade-offs and distributional shifts.

The paper tackles off-policy evaluation in Markov decision processes by developing a confidence interval estimator that addresses distributional shift and error quantification, achieving tight intervals and robustness in synthetic and real-world health data.

We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.

View on arXiv PDF

Similar