LGAIMLSep 28, 2019

Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

arXiv:1909.13158v12 citations
Originality Incremental advance
AI Analysis

This work addresses a computational bottleneck for researchers and practitioners using these algorithms in reinforcement learning, though it is incremental as it builds on existing index-based methods.

The paper tackles the computational inefficiency of computing indices for MDP-UCB and MDP-DMED algorithms in reinforcement learning by deriving methods that reduce the problem to solving two non-linear equations or a single equation, respectively, independent of state space size, and demonstrates computational time savings and regret performance in experiments.

In this paper we derive an efficient method for computing the indices associated with an asymptotically optimal upper confidence bound algorithm (MDP-UCB) of Burnetas and Katehakis (1997) that only requires solving a system of two non-linear equations with two unknowns, irrespective of the cardinality of the state space of the Markovian decision process (MDP). In addition, we develop a similar acceleration for computing the indices for the MDP-Deterministic Minimum Empirical Divergence (MDP-DMED) algorithm developed in Cowan et al. (2019), based on ideas from Honda and Takemura (2011), that involves solving a single equation of one variable. We provide experimental results demonstrating the computational time savings and regret performance of these algorithms. In these comparison we also consider the Optimistic Linear Programming (OLP) algorithm (Tewari and Bartlett, 2008) and a method based on Posterior sampling (MDP-PS).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes