LGAIMLOTOct 27, 2013

Generalized Thompson Sampling for Contextual Bandits

arXiv:1310.7163v123 citations
Originality Incremental advance
AI Analysis

This work addresses a theoretical gap for researchers in bandit algorithms, offering a novel framework that generalizes an existing heuristic, though it is incremental in extending prior work.

The paper tackles the problem of theoretical understanding of Thompson Sampling in contextual bandits by proposing Generalized Thompson Sampling, a new family of algorithms that includes Thompson Sampling as a special case and derives general regret bounds applicable to general contextual bandits, quantifying the effect of the prior distribution on regret.

Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expert-learning framework, which includes Thompson Sampling as a special case. Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes