LGFeb 9, 2018

Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

arXiv:1802.03386v134 citations
AI Analysis

This addresses a theoretical gap in online learning for contextual bandits, which is incremental but solves a known challenge.

The paper resolves an open problem by achieving a first-order regret bound for contextual bandits, scaling with the square root of the optimal performance rather than the time horizon.

Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon $T$. The more refined concept of first-order regret bound replaces this with a scaling $\sqrt{L^*}$, which may be much smaller than $\sqrt{T}$. It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes