Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring
This work addresses a theoretical gap in online decision-making for researchers and practitioners in machine learning, though it is incremental as it extends Thompson sampling to a specific variant of partial monitoring.
The paper tackles the problem of sequential learning with limited feedback in stochastic partial monitoring by proposing a novel Thompson-sampling-based algorithm that enables exact sampling from the posterior distribution. It proves that this algorithm achieves a logarithmic expected pseudo-regret of O(log T) for a linearized variant with local observability, marking the first such regret bound for Thompson sampling in partial monitoring and linear bandits.
We investigate finite stochastic partial monitoring, which is a general model for sequential learning with limited feedback. While Thompson sampling is one of the most promising algorithms on a variety of online decision-making problems, its properties for stochastic partial monitoring have not been theoretically investigated, and the existing algorithm relies on a heuristic approximation of the posterior distribution. To mitigate these problems, we present a novel Thompson-sampling-based algorithm, which enables us to exactly sample the target parameter from the posterior distribution. Besides, we prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $\mathrm{O}(\log T)$ for a linearized variant of the problem with local observability. This result is the first regret bound of Thompson sampling for partial monitoring, which also becomes the first logarithmic regret bound of Thompson sampling for linear bandits.