LG AI SY MLSep 9, 2019

Off-Policy Evaluation in Partially Observable Environments

Guy Tennenholtz, Shie Mannor, Uri Shalit

arXiv:1909.03739v322.192 citations

Originality Highly original

AI Analysis

It addresses a critical bias issue in reinforcement learning for domains like healthcare where partial observability is common, representing a foundational advance rather than an incremental step.

This paper tackles the problem of batch off-policy evaluation in partially observable environments, establishing the first such result for POMDPs and proposing a Decoupled POMDP model to mitigate estimation errors, with demonstrations on synthetic medical data.

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

View on arXiv PDF

Similar