LGSTMLFeb 18, 2021

Off-policy Confidence Sequences

arXiv:2102.09540v119 citations
AI Analysis

This work addresses the challenge of reliable and safe deployment of contextual bandit systems, with incremental improvements in confidence sequence methods.

The paper tackles the problem of off-policy evaluation in contextual bandits by developing confidence bounds that hold uniformly over time, based on martingale analysis, and demonstrates tightness in failure probability and width, applying it to safely upgrade production systems.

We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the "gated deployment" problem of safely upgrading a production contextual bandit system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes