LGFeb 7, 2012

Contextual Bandit Learning with Predictable Rewards

arXiv:1202.1334v2101 citations
AI Analysis

This work addresses the contextual bandit learning problem for reinforcement learning practitioners, but it is incremental as it builds on existing assumptions and algorithms.

The paper tackles the contextual bandit problem under a realizability assumption, showing that a new algorithm, Regressor Elimination, achieves regret similar to the agnostic setting, and proves a lower bound indicating no algorithm can do better in worst-case scenarios, but it achieves constant regret for specific reward distributions.

Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a realizability assumption: there exists a function in a (known) function class, always capable of predicting the expected reward, given the action and context. Under this assumption, we show three things. We present a new algorithm---Regressor Elimination--- with a regret similar to the agnostic setting (i.e. in the absence of realizability assumption). We prove a new lower bound showing no algorithm can achieve superior performance in the worst case even with the realizability assumption. However, we do show that for any set of policies (mapping contexts to actions), there is a distribution over rewards (given context) such that our new algorithm has constant regret unlike the previous approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes