AILGMLOct 8, 2017

An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

arXiv:1710.02869v28 citations
Originality Highly original
AI Analysis

This work addresses a fundamental problem in reinforcement learning for decision-making under uncertainty, offering a novel approach with theoretical guarantees.

The paper tackles the exploration-exploitation trade-off in stochastic, discrete multi-armed bandits by proposing an information-theoretic strategy based on the value of information criterion, achieving optimal logarithmic regret with respect to the number of episodes.

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes