MLLGOct 23, 2021

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

arXiv:2110.12175v216 citations
Originality Incremental advance
AI Analysis

This work addresses a gap in reinforcement learning for sequential decision-making under partial observability, providing incremental theoretical advancements for researchers in bandit algorithms.

The paper tackles the problem of partially observable contextual multi-armed bandits by proposing a Thompson Sampling algorithm, establishing theoretical guarantees including logarithmic regret scaling with time and arms, and linear scaling with dimension, supported by numerical analyses.

Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the control actions. For this computationally fast algorithm, performance analyses are available under full context-observations. However, little is known for problems that contexts are not fully observed. We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees. Technically, we show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension. Further, we establish rates of learning unknown parameters, and provide illustrative numerical analyses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes