MLLGSTSep 21, 2022

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

arXiv:2209.10064v215 citationsh-index: 14
Originality Highly original
AI Analysis

This addresses the problem of evaluating policies in complex, partially observable environments for reinforcement learning researchers, though it is incremental as it builds on existing proximal causal inference frameworks.

The paper tackles off-policy evaluation in episodic partially observable Markov decision processes with continuous states by developing a non-parametric identification method using V-bridge functions and proxy variables, resulting in the first finite-sample error bounds for this problem under non-parametric models.

We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV problem, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes