Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies
This work addresses the problem of adaptive policy selection in reinforcement learning for researchers and practitioners, but it is incremental as it builds on existing methods with comparative analysis.
The paper compares the performance of the classic UCB policy with two new policies, MDP-DMED and MDP-PS, for reinforcement learning in Markov decision processes with unknown transition probabilities, finding that the new policies offer competitive or improved results in specific scenarios.
In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of \cc{bkmdp97} with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).