LGMLApr 18, 2021

Off-Policy Risk Assessment in Contextual Bandits

arXiv:2104.08977v244 citations
Originality Incremental advance
AI Analysis

This work addresses the need for off-policy risk assessment in contextual bandits, enabling practitioners to evaluate policies under various risk measures, though it is incremental as it builds on existing off-policy evaluation methods.

The paper tackled the problem of evaluating diverse risk objectives beyond expected reward in contextual bandits using logged data, introducing the OPRA framework that provides finite sample guarantees and achieves convergence rates of O(1/√n) for risk estimates.

Even when unable to run experiments, practitioners can evaluate prospective policies, using previously logged data. However, while the bandits literature has adopted a diverse set of objectives, most research on off-policy evaluation to date focuses on the expected reward. In this paper, we introduce Lipschitz risk functionals, a broad class of objectives that subsumes conditional value-at-risk (CVaR), variance, mean-variance, many distorted risks, and CPT risks, among others. We propose Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy's CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class. We instantiate OPRA with both importance sampling and doubly robust estimators. Our primary theoretical contributions are (i) the first uniform concentration inequalities for both CDF estimators in contextual bandits and (ii) error bounds on our Lipschitz risk estimates, which all converge at a rate of $O(1/\sqrt{n})$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes