MLAILGMay 23, 2018

Analysis of Thompson Sampling for Graphical Bandits Without the Graphs

arXiv:1805.08930v117 citations
Originality Incremental advance
AI Analysis

This addresses a theoretical gap in online decision-making under partial feedback for researchers in bandit algorithms, with incremental improvements in regret bounds for directed settings.

The paper tackles the problem of multi-armed bandits with graph feedback where the graph is unknown and time-varying, showing that original Thompson Sampling achieves near-optimal regret of $ ilde{O}\left(\sqrt{eta_0(G)T} ight)$ for undirected graphs, and proposes a variant for directed graphs to achieve similar optimality.

We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected, the original Thompson Sampling achieves the optimal (within logarithmic factors) regret $\tilde{O}\left(\sqrt{β_0(G)T}\right)$ over time horizon $T$, where $β_0(G)$ is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes