AI LG MLOct 8, 2017

An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

arXiv:1710.02869v27.68 citations

Originality Highly original

AI Analysis

This work addresses a fundamental problem in reinforcement learning for decision-making under uncertainty, offering a novel approach with theoretical guarantees.

The paper tackles the exploration-exploitation trade-off in stochastic, discrete multi-armed bandits by proposing an information-theoretic strategy based on the value of information criterion, achieving optimal logarithmic regret with respect to the number of episodes.

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.

View on arXiv PDF

Similar