LGAISYApr 5, 2023

Conformal Off-Policy Evaluation in Markov Decision Processes

arXiv:2304.02574v211 citationsh-index: 48
AI Analysis

This addresses the need for reliable off-policy evaluation in applications where experimentation is costly or unethical, offering improved accuracy with formal guarantees.

The paper tackles the problem of evaluating reinforcement learning policies from historical data without online experimentation, presenting a conformal prediction method that provides guaranteed confidence intervals for policy rewards. The method reduces interval lengths compared to existing approaches while maintaining certainty levels.

Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes