MLLGFeb 2, 2024

Distributional Off-policy Evaluation with Bellman Residual Minimization

arXiv:2402.01900v32 citationsh-index: 1
Originality Highly original
AI Analysis

This work addresses a theoretical bottleneck in offline reinforcement learning for researchers and practitioners by offering a more manageable approach to learning return distributions, though it is incremental as it builds on existing distributional OPE methods.

The paper tackles the problem of distributional off-policy evaluation by proposing a new method called Energy Bellman Residual Minimizer (EBRM) that uses expectation-extended statistical distances, which are easier to estimate than prior supremum-based distances, and provides finite-sample error bounds without requiring the completeness assumption.

We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes