LGAIJun 16, 2021

Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?

arXiv:2106.08771v2
Originality Highly original
AI Analysis

This addresses computational efficiency for researchers and practitioners in multi-armed bandit problems, offering a scalable solution with proven bounds, though it is incremental as it adapts existing algorithms to a specific structure.

The paper tackles the scalability of reinforcement learning algorithms for Markovian bandits, showing that a Bayesian variant (MB-PSRL) achieves near-optimal regret of $ ilde{O}(S\sqrt{nK})$ and linear runtime in the number of bandits, outperforming non-Bayesian methods in experiments.

We study learning algorithms for the classical Markovian bandit problem with discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the problem structure. These variants are called MB-PSRL and MB-UCRL2. While the regret bound and runtime of vanilla implementations of PSRL and UCRL2 are exponential in the number of bandits, we show that the episodic regret of MB-PSRL and MB-UCRL2 is $\tilde{O}(S\sqrt{nK})$ where $K$ is the number of episodes, $n$ is the number of bandits and $S$ is the number of states of each bandit (the exact bound in S, n and K is given in the paper). Up to a factor $\sqrt S$, this matches the lower bound of $Ω(\sqrt{SnK})$ that we also derive in the paper. MB-PSRL is also computationally efficient: its runtime is linear in the number of bandits. We further show that this linear runtime cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2 or UCBVI to Markovian bandit problems. Finally, we perform numerical experiments that confirm that MB-PSRL outperforms other existing algorithms in practice, both in terms of regret and of computation time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes