LGAIJun 29, 2020

Learning and Planning in Average-Reward Markov Decision Processes

arXiv:2006.16318v385 citations
Originality Incremental advance
AI Analysis

This addresses challenges in reinforcement learning for continuous tasks by providing more reliable and easier-to-use algorithms, though it builds incrementally on existing proof techniques.

The authors tackled the problem of learning and planning in average-reward Markov Decision Processes by introducing new off-policy algorithms that eliminate the need for reference states, resulting in the first proven-convergent model-free control and prediction methods that converge to the actual value function rather than an offset.

We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes