AISep 12, 2014

On Minimax Optimal Offline Policy Evaluation

arXiv:1409.3653v116 citations
Originality Highly original
AI Analysis

This addresses the problem of accurately evaluating policies from observational data for researchers and practitioners in reinforcement learning, providing theoretical guarantees for estimator performance.

The paper tackles the off-policy evaluation problem by establishing a minimax risk lower bound for multi-armed bandits, showing that one standard estimator is minimax optimal up to a constant while another can be arbitrarily worse, with results extended to contextual bandits and Markov decision processes.

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes