LGMLFeb 26, 2021

Adapting to Misspecification in Contextual Bandits with Offline Regression Oracles

arXiv:2102.13240v228 citations
Originality Incremental advance
AI Analysis

This addresses robustness issues in contextual bandits for applications like online advertising or recommendation systems, but it is incremental as it builds on prior work to simplify implementation.

The paper tackles the problem of contextual bandit algorithms suffering from unexpected regret when reward models are misspecified, by proposing a simple family of algorithms that adapt to misspecification error by reverting to a safe policy, with regret guarantees that degrade gracefully based on average misspecification level.

Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020a); Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes