LGMLOct 22, 2020

CoinDICE: Off-Policy Confidence Interval Estimation

arXiv:2010.11652v192 citations
Originality Incremental advance
AI Analysis

This addresses the need for reliable off-policy evaluation in reinforcement learning, particularly for safety-critical applications, though it is incremental as it builds on existing methods with improvements in accuracy and tightness.

The paper tackles the problem of estimating confidence intervals for a target policy's value in reinforcement learning using only a static dataset from unknown behavior policies, and proposes CoinDICE, which provides tighter and more accurate intervals than existing methods in benchmarks.

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes