Adaptive Data Augmentation for Thompson Sampling
This work addresses the theoretical limitations of Thompson Sampling in contextual bandits, which is important for applications like online advertising and recommendation systems, though it appears incremental as it builds on prior Thompson Sampling methods.
The paper tackles the problem of suboptimal regret bounds in Thompson Sampling for linear contextual bandits by proposing a novel estimator with adaptive augmentation and coupling of hypothetical samples, achieving nearly minimax optimal regret and showing robust performance with significant improvement over existing methods in empirical results.
In linear contextual bandits, the objective is to select actions that maximize cumulative rewards, modeled as a linear function with unknown parameters. Although Thompson Sampling performs well empirically, it does not achieve optimal regret bounds. This paper proposes a nearly minimax optimal Thompson Sampling for linear contextual bandits by developing a novel estimator with the adaptive augmentation and coupling of the hypothetical samples that are designed for efficient parameter learning. The proposed estimator accurately predicts rewards for all arms without relying on assumptions for the context distribution. Empirical results show robust performance and significant improvement over existing methods.