LGMLFeb 3, 2019

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

arXiv:1902.00980v3144 citations
Originality Highly original
AI Analysis

This work addresses efficient and optimal decision-making in changing environments for applications like online advertising and recommendation systems, representing a significant advance over incremental improvements.

The paper tackles the problem of non-stationary contextual bandits by proposing a parameter-free algorithm that achieves dynamic regret of O(min{√(ST), Δ^(1/3)T^(2/3)}), improving upon prior bounds and generalizing results to contextual settings.

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, Δ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $Δ$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $Δ$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, Δ^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes