ML LGFeb 2, 2024

Distributional Off-policy Evaluation with Bellman Residual Minimization

Sungee Hong, Zhengling Qi, Raymond K. W. Wong

arXiv:2402.01900v39.22 citationsh-index: 1Has Code

Originality Highly original

AI Analysis

This work addresses a theoretical bottleneck in offline reinforcement learning for researchers and practitioners by offering a more manageable approach to learning return distributions, though it is incremental as it builds on existing distributional OPE methods.

The paper tackles the problem of distributional off-policy evaluation by proposing a new method called Energy Bellman Residual Minimizer (EBRM) that uses expectation-extended statistical distances, which are easier to estimate than prior supremum-based distances, and provides finite-sample error bounds without requiring the completeness assumption.

We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.

View on arXiv PDF Code

Similar