LG MLJan 21, 2024

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

arXiv:2401.11565v27.93 citationsEntropy

Originality Incremental advance

AI Analysis

This work addresses the challenge of decision-making under uncertainty with noisy observations in bandit problems, which is incremental as it extends Thompson sampling to a specific noisy context scenario.

The paper tackles the problem of stochastic contextual linear bandits with noisy contexts, where the agent observes corrupted contexts through an unknown noise channel, and designs a Thompson sampling algorithm to approximate an oracle's policy. It demonstrates Bayesian regret bounds through information-theoretic analysis and shows empirically that the algorithm outperforms baselines, with delayed true contexts leading to lower regret.

We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.

View on arXiv PDF

Similar